-
Notifications
You must be signed in to change notification settings - Fork 442
[Bugfix] Open Rocm ci test and fix some bugs. #1443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds HIP/ROCm FP8/MFMA and warp-reduce support, refactors HIP atomics and debug printing into trait/util namespaces, extends HIP codegen (ptx_cp_async, warp_reduce intrinsics), updates JIT wrapper/type maps and engine include-path resolution, and applies broad test gating/accum-dtype updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer / CI
participant JIT as JIT Wrapper
participant Lower as Engine Lower
participant Codegen as HIP Codegen
participant Device as GPU
Note over Dev,JIT: submit kernel source / tests
Dev->>JIT: provide kernel source
JIT->>JIT: extract declaration (get_declaration) %% new CUDA/HIP paths
JIT->>Lower: send declaration + type map (incl. FP8)
Lower->>Lower: resolve include paths (tilelang.env) & MFMA/FP8 mapping
Lower->>Codegen: request HIP emission (ptx_cp_async, warp_reduce, atomics)
Codegen->>Codegen: generate HIP kernel using tl::shfl / tl::warp_reduce / atomics
Codegen->>JIT: return emitted kernel source
JIT->>Device: compile & launch
Device-->>Dev: runtime results / tests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-11-14T07:56:11.098ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
00af570 to
8e77897
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
testing/python/language/test_tilelang_language_atomic_add.py (1)
210-223: Fix critical dtype mismatch in tensor creation.The function accepts a
dtypeparameter and passes it to the kernel (Line 211), but test tensors are always created withtorch.float32(Lines 214-215). This causes a type mismatch when the kernel expects a different dtype (e.g.,T.float16).All other atomic test functions in this file correctly use
dtype=getattr(torch, dtype)when creating test tensors.🔎 Proposed fix
- A = torch.randn(M, N, dtype=torch.float32).cuda() - B = torch.zeros(M, N, dtype=torch.float32).cuda() + A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() + B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
31-33: Inconsistent with broader PR pattern for target selection.This file still uses
target="cuda"while other test files in this PR were updated to usetarget="auto"to enable automatic backend selection. For consistency with the multi-backend support objectives, consider updating this to:- kernel = tilelang.compile( - program, out_idx=[1], target="cuda", pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True} - ) + kernel = tilelang.compile( + program, out_idx=[1], target="auto", pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True} + )Files that were updated in this PR include:
test_tilelang_language_mask_op.py(lines 32, 68, 105, 141)test_tilelang_language_unroll.py(lines 16, 32)testing/python/language/test_tilelang_language_mask_op.py (1)
31-36: Add@tilelang.testing.requires_cudadecorators to test functions.The compilation uses
target="auto", which auto-detects CUDA/HIP/Metal based on availability (pertilelang/utils/target.py). However, all four test functions hardcodedevice="cuda"for tensor creation, creating a device mismatch iftarget="auto"selects HIP or Metal on a system with multiple backends available.Add decorators to:
test_tilelang_copy_mask_parallel(line 39)test_tilelang_copy_mask_copy(line 75)test_tilelang_copy_mask_parallel_range(line 112)test_tilelang_copy_mask_copy_range(line 148)This matches the pattern used in other test files in the same directory and ensures tests skip gracefully on non-CUDA systems.
testing/python/debug/test_tilelang_debug_print.py (1)
8-17: Inconsistent target specification across helper functions.The
debug_print_bufferfunction (line 15) doesn't specifytarget="auto", while other helper functions (debug_print_buffer_conditional,debug_print_value_conditional, etc.) explicitly usetarget="auto". This inconsistency could cause the parameterizedtest_debug_print_bufferto fail on ROCm-only systems.🔎 Suggested fix
def debug_print_buffer(M=16, N=16, dtype=T.float16): @T.prim_func def program(Q: T.Tensor((M, N), dtype)): with T.Kernel(4, 4, 2, threads=128 * 2) as (bx, by, bz): shared_buf = T.alloc_shared([M, N], dtype) T.print(shared_buf) - jit_kernel = tilelang.compile(program) + jit_kernel = tilelang.compile(program, target="auto") profiler = jit_kernel.get_profiler() profiler.run_once()
🧹 Nitpick comments (7)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
386-386: LGTM! Properly gates CUDA-specific functionality.The decorator correctly ensures this test only runs when CUDA is available, preventing failures on systems without CUDA support (particularly relevant given the PR's ROCM multi-backend focus).
Optional: Consider applying to other CUDA-dependent tests
Several other test functions in this file also use
.cuda()calls without the decorator:
test_gemm_jit_kernel(lines 137-138)test_tvm_ffi_kernel_multi_stream(line 246)test_tvm_ffi_dynamic_shape(lines 285-286)If these tests haven't been updated elsewhere in the PR, consider applying the same decorator for consistency.
testing/python/jit/test_tilelang_jit_callback.py (1)
117-117: Note: Accumulation dtype change has no effect (test is skipped).The change from
T.float16toT.float32accumulation dtype occurs intest_cuda_postproc_callback, which is skipped (line 107). This change will have no runtime effect until the test is re-enabled.testing/python/language/test_tilelang_language_annot.py (1)
7-8: Track the cython backend bug for future resolution.The TODO comments indicate a known bug affecting these tests with the cython backend, and the
@requires_cudadecorators effectively disable them in certain configurations. This creates technical debt.Do you want me to open a tracking issue for the cython backend build bug to ensure it's addressed in a future PR?
Also applies to: 31-32, 55-56
testing/python/kernel/test_tilelang_kernel_gemm.py (1)
237-252: Track the ROCm precision issue for f32f32f32_tn GEMM.The TODO comment indicates a known precision issue with this specific GEMM configuration (transposed A, float32) on ROCm. This creates technical debt that should be tracked.
Do you want me to open a tracking issue for the ROCm precision problem with the f32f32f32_tn GEMM configuration to ensure it's investigated in a future PR?
tilelang/engine/lower.py (1)
118-120: Remove unused local variable assignment.Line 119 assigns
tl_template_pathbut never uses it. This appears to be leftover code from the refactoring to use the centralizedTILELANG_TEMPLATE_PATHconstant.🔎 Proposed cleanup
@tvm_ffi.register_global_func("tilelang_callback_hip_compile", override=True) def tilelang_callback_hip_compile(code, target): - project_root = osp.join(osp.dirname(__file__), "../..") - tl_template_path = osp.abspath(osp.join(project_root, "src")) - hsaco = hipcc.compile_hip(testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)
434-436: Track ROCm precision issues.Multiple TODO comments indicate precision problems on ROCm for specific test configurations. These disabled test cases reduce coverage for ROCm backend.
Would you like me to help create tracking issues for these ROCm precision problems to ensure they're addressed systematically?
Also applies to: 441-443, 466-468, 598-600, 631-633
src/tl_templates/hip/debug.h (1)
313-313: Inconsistent indentation: tab character instead of spaces.Line 313 uses a tab character for indentation while the rest of the file uses spaces.
🔎 Proposed fix
- PrintTraits<T>::print_buffer(msg, buf_name, index, var); + PrintTraits<T>::print_buffer(msg, buf_name, index, var);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (52)
.github/workflows/ci.ymlsrc/op/gemm.ccsrc/op/logical.ccsrc/target/codegen_hip.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/atomic.hsrc/tl_templates/hip/common.hsrc/tl_templates/hip/debug.hsrc/tl_templates/hip/hip_fp8.hsrc/tl_templates/hip/reduce.htesting/python/autotune/test_tilelang_autotune.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/debug/test_device_assert.pytesting/python/debug/test_tilelang_debug_print.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/issue/test_tilelang_issue_830.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_alias.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_composable_index.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytilelang/engine/lower.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (1)
- testing/python/issue/test_tilelang_issue_830.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_vectorized_cast.pytilelang/engine/lower.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/autotune/test_tilelang_autotune.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/language/test_tilelang_language_alias.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/language/test_tilelang_language_annot.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_clear.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/jit/test_tilelang_jit_callback.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_alias.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_clear.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
tilelang/engine/lower.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Applied to files:
tilelang/intrinsics/mfma_layout.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
tilelang/intrinsics/mfma_layout.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.
Applied to files:
testing/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/jit/test_tilelang_jit_cutedsl.py
🧬 Code graph analysis (23)
src/op/gemm.cc (1)
src/target/utils.cc (2)
TargetIsCDNA(73-82)TargetIsCDNA(73-73)
testing/python/debug/test_device_assert.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_mask_op.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_copy.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_ptr.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/components/test_storage_rewrite_detect_inplace.py (3)
tilelang/language/ast/ir.py (1)
target(1677-1707)tilelang/utils/target.py (1)
check_hip_availability(42-52)src/target/codegen_hip.cc (2)
pattern(66-69)pattern(66-67)
testing/python/issue/test_tilelang_issue_96.py (3)
tilelang/jit/__init__.py (2)
compile(48-108)compile(353-379)tilelang/jit/kernel.py (1)
out_idx(609-610)tilelang/language/ast/ir.py (1)
target(1677-1707)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
run(121-181)run(188-259)
src/tl_templates/hip/debug.h (4)
tilelang/language/print_op.py (1)
print_var(15-25)src/tl_templates/cuda/gemm_sp_sm80.h (1)
unsigned(124-126)tilelang/language/v2/dtypes.py (2)
short(235-235)long(238-238)src/tl_templates/cuda/reduce.h (2)
half_t(17-19)bfloat16_t(20-22)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(10-21)
tilelang/intrinsics/mfma_macro_generator.py (1)
tilelang/language/v2/dtypes.py (1)
int8(243-243)
testing/python/language/test_tilelang_language_alias.py (1)
tilelang/jit/__init__.py (2)
compile(48-108)compile(353-379)
src/tl_templates/hip/common.h (1)
src/tl_templates/cuda/common.h (5)
tl(205-299)tl(528-530)tl(531-533)tl(548-628)bfloat16_t(538-539)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(10-21)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/v2/dtypes.py (2)
float16(299-299)float32(300-300)tilelang/language/atomic.py (1)
atomic_addx2(226-261)
src/tl_templates/hip/reduce.h (2)
src/tl_templates/cuda/reduce.h (3)
tl(10-83)run(110-171)run(178-250)src/tl_templates/hip/common.h (13)
tl(110-200)shfl_down(138-140)shfl_down(157-161)shfl_down(182-186)shfl_xor(134-136)shfl_xor(151-155)shfl_xor(176-180)shfl(146-148)shfl(169-173)shfl(194-198)shfl_up(142-144)shfl_up(163-167)shfl_up(188-192)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
compile(48-108)compile(353-379)
testing/python/language/test_tilelang_language_unroll.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_clear.py (2)
tilelang/jit/__init__.py (2)
compile(48-108)compile(353-379)tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_composable_index.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
tilelang/carver/arch/driver/cuda_driver.py (7)
get_cuda_device_properties(22-30)get_device_name(33-36)get_shared_memory_per_block(39-52)get_device_attribute(55-78)get_persisting_l2_cache_max_size(97-99)get_num_sms(102-118)get_registers_per_block(121-129)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
float32(300-300)
testing/python/jit/test_tilelang_jit_callback.py (2)
tilelang/engine/callback.py (2)
register_cuda_postproc_callback(29-59)register_hip_postproc_callback(62-92)tilelang/language/v2/dtypes.py (1)
float32(300-300)
🔇 Additional comments (66)
testing/python/issue/test_tilelang_issue_1001.py (1)
26-26: LGTM! Correct CUDA gating for device-specific test.The decorator appropriately gates this test behind CUDA availability, which is necessary given the hardcoded
device="cuda"on line 29.testing/python/language/test_tilelang_language_infinity.py (2)
3-3: Verify thattilelang.testingmodule provides the required decorator and entry point.The changes introduce dependency on
tilelang.testing.requires_cudadecorator andtilelang.testing.main()entry point. Per the PR objectives, these are part of the new test infrastructure updates for CUDA requirement decorators. However, the implementation and API of this module is not visible in the provided context.Confirm the following:
- The
tilelang.testing.requires_cudadecorator properly skips tests when CUDA is unavailable- The
tilelang.testing.main()entry point is compatible with standard test runners (pytest, unittest)- Test discovery and CI/CD integration still work as expected
Additionally, given that this PR introduces HIP/ROCM support ("rocm_ci"), consider whether the test should use a decorator like
requires_cuda_or_hipor run on both backends.Also applies to: 24-24, 33-33
17-29: Verify infinity comparison logic across all dtypes, especially float16 and bfloat16.The test compares kernel output directly with
torch.infacross four dtypes. While this should work for standard IEEE 754 floats, float16 and bfloat16 have reduced precision and may exhibit different behavior.Confirm:
- How does
torch.inf == torch.infbehave for float16 and bfloat16 tensors?- Are there any dtype promotion issues when comparing against the float32-typed
torch.infconstant?- Has this test been validated on actual CUDA devices with all four dtypes?
Additionally, the kernel is called without explicit device selection (line 19). Ensure the implicit device context is correctly set by the
@requires_cudadecorator.testing/python/issue/test_tilelang_issue_1008.py (1)
42-42: LGTM! Decorators correctly gate CUDA-specific tests.The
@tilelang.testing.requires_cudadecorators are appropriate since both tests explicitly usedevice="cuda". This ensures tests are skipped on systems without CUDA support, aligning with the multi-backend test-gating pattern in this PR.Also applies to: 49-49
testing/python/jit/test_tilelang_jit_gemm_cython.py (3)
160-174: LGTM! Using float32 for accumulation improves numerical stability.Accumulating in float32 while keeping float16 for input/output is a standard practice that prevents precision loss and potential overflow during GEMM operations.
210-211: Consistent precision improvement applied to benchmarking and multi-stream tests.The float32 accumulator change is correctly applied to both the benchmarking test and multi-stream test functions.
Also applies to: 254-255
303-308: Float32 accumulation consistently applied to all dynamic shape test variants.All dynamic shape tests (with varying combinations of dynamic M, N, K dimensions) now use float32 for accumulation, maintaining consistency with the other test functions.
Also applies to: 356-357
testing/python/jit/test_tilelang_jit_cutedsl.py (1)
100-100: LGTM! CUDA-specific tests correctly gated.The
@tilelang.testing.requires_cudadecorators appropriately gate these test functions, as they all rely on CUDA-specific PyTorch APIs (.cuda()for tensor placement andtorch.cuda.Stream()for stream management). This aligns with the broader multi-backend infrastructure work.Also applies to: 221-221, 273-273, 316-316, 366-366
testing/python/language/test_tilelang_language_vectorized_cast.py (1)
78-78: LGTM! Appropriate CUDA requirement gate.The
@tilelang.testing.requires_cudadecorator is correctly added to gate this test, which creates CUDA tensors and executes CUDA kernels. This is consistent with the other test functions in this file (lines 100, 115) and aligns with the PR's objective of adding multi-backend support..github/workflows/ci.yml (1)
406-407: Temporary test exclusion is acceptable for WIP.The ignore flags for
./python/runtimeand./python/transformare appropriate for a work-in-progress ROCm implementation. However, ensure these exclusions are addressed and removed before the PR is marked ready for final merge, as they create a test coverage gap compared to the CUDA tests.testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)
157-157: Correct dtype alignment for reference computation.The reference result now correctly uses
out_dtypeinstead ofaccum_dtype, ensuring the reference matches the actual output dtype of the kernel. This aligns with the pattern used in similar test files (test_tilelang_kernel_gemm_mma.pyandtest_tilelang_kernel_gemm_int4_gemm_mma.py).
164-164: Verify the intent of accumulation dtype change.The accumulation dtype for the first test case changed from
T.float16toT.float32, which increases precision. Please confirm whether this change is:
- A correction to fix a previously incorrect test configuration
- An intentional modification to test higher-precision accumulation
- Related to numerical stability or accuracy concerns
168-170: LGTM! CUDA-specific int8 test is properly gated.The new
test_assert_tl_matmul_int8function is correctly decorated with@tilelang.testing.requires_cudasince int8 matmul uses the dp4a instruction (line 119 intl_matmul_simt), which is CUDA-specific. This aligns with the PR's objective to support multi-backend execution with appropriate test gating.testing/python/language/test_tilelang_language_var_init.py (1)
5-6: LGTM! Appropriate gating for unsupported HIP functionality.The decorator correctly gates this test to CUDA-only execution, preventing failures on HIP/ROCM where variable initialization is not yet supported. The TODO comment clearly documents this limitation for future work.
testing/python/jit/test_tilelang_jit_gemm.py (1)
106-120: LGTM! Improved numerical stability with FP32 accumulation.The function rename accurately reflects the accumulator dtype change, and switching from FP16 to FP32 accumulation significantly improves numerical stability in GEMM operations. Accumulating in higher precision (FP32) while keeping FP16 inputs and output is a best practice that reduces overflow/underflow risks during matrix multiplication.
This change is consistent with the broader pattern across GEMM/JIT tests in this PR.
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
167-167: LGTM! Improved numerical stability.Changing the accumulation dtype from
T.float16toT.float32improves numerical stability for GEMM operations while maintaining the memory and bandwidth benefits of float16 for inputs and outputs. This is a standard best practice for mixed-precision computation, especially important for multi-backend support where consistent numerical behavior is desired.Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306
testing/python/language/test_tilelang_language_atomic_add.py (2)
198-207: LGTM: dtype parameterization added correctly.The dtype parameterization for
atomic_addx2_programis well-structured and consistent with other atomic operation functions in this file.
238-238: LGTM: CUDA requirement decorators added appropriately.The
@tilelang.testing.requires_cudadecorators correctly gate CUDA-specific atomic operations, ensuring tests only run on appropriate hardware. This aligns with the PR's multi-backend support objectives.Also applies to: 243-243, 353-353
testing/python/jit/test_tilelang_jit_parcompile.py (1)
61-62: Verify the block_K configuration values.The test configurations at lines 61-62 use different
block_Kvalues (64 and 32). Confirm whether these values are intentional for testing different blocking strategies or if they require explanation. If this change is part of ROCm optimization, consider adding a comment to document the rationale.testing/python/components/test_storage_rewrite_detect_inplace.py (2)
4-6: LGTM! Appropriate use of import-time configuration.The module-level HIP availability check is a reasonable approach for test configuration. The result is cached and reused throughout the test, avoiding repeated checks.
60-62: Backend-specific code generation is expected behavior; consider adding a documentation comment.The test correctly validates that the inplace detection optimization works on both HIP and CUDA backends, but they generate different code patterns:
- HIP:
read[0] = (read[0] * 2);(array-style indexing)- CUDA:
read = (read * 2);(scalar access)This difference is intentional and stems from how each backend's code generator handles intermediate variables. However, adding a comment explaining why HIP uses array-style indexing while CUDA uses scalar access would improve code clarity for future maintainers.
testing/python/jit/test_tilelang_jit_callback.py (3)
4-4: LGTM! Import addition supports HIP backend.The import of
register_hip_postproc_callbackmirrors the existing CUDA callback import and enables HIP post-processing support, aligning with the PR's multi-backend objectives.
92-96: LGTM! HIP callback correctly mirrors CUDA implementation.The HIP post-processing callback properly mirrors the CUDA callback structure and will register for the HIP backend. The minor code duplication between the two callbacks is acceptable for test code.
232-232: No action required. The accumulation dtypeT.float32is consistent with the standard GEMM implementation pattern across the codebase. Nearly all other GEMM tests (CPU, WebGPU, Metal, Transform, Profiler, Language tests, etc.) useT.float32for accumulation by default. This is not a HIP-specific requirement but rather the standard approach for numerical precision—the reference computation in the test itself explicitly performs accumulation in float32 before converting to the output dtype. The change is correct and requires no further adjustment.tilelang/intrinsics/mfma_layout.py (2)
3-3: LGTM: Import addition is correct.The addition of
tvm.tir.constis appropriate for creating TIR constant expressions in the modified functions.
9-9: Useconst(0)for TIR scalar constants in layout functions.The change from
convert(0)toconst(0)is correct. In TIR context,const()is the proper API for creating scalar constant expressions, whileconvert()is a runtime conversion utility. These layout functions are used in TIR operations where their return values are unpacked and applied directly to array indices, makingconst(0)semantically appropriate. The existing test suite (test_tilelang_gemm_mfma_intrinsic.py) exercises these functions through MFMA matrix operations and will catch any regressions.src/tl_templates/hip/hip_fp8.h (1)
1-1: LGTM: Standard include guard added.Adding
#pragma onceis a best practice to prevent multiple inclusions of this header file.src/op/gemm.cc (1)
157-159: LGTM: CDNA warp partitioning correctly configured.The CDNA target is properly handled by setting
kNPerWarp = 16, consistent with Volta architecture. This aligns with AMD CDNA (gfx9xx) warp characteristics and ensures correct matrix partitioning for MFMA instructions.testing/python/issue/test_tilelang_issue_96.py (1)
40-44: Verify device compatibility with "auto" target.The compilation target is set to
"auto", but lines 43-44 explicitly create tensors on the"cuda"device. If the "auto" target selects a non-CUDA backend or CUDA is unavailable, this test will fail at runtime. Consider adding@tilelang.testing.requires_cudadecorator to the test functions (lines 52 and 57) to ensure CUDA availability, or update the tensor creation to use a device that matches the auto-selected target.testing/python/language/test_tilelang_language_composable_index.py (1)
33-39: Verify device compatibility with "auto" target.The compilation target has been changed to
"auto", but line 39 still creates tensors withdevice="cuda". This creates a potential mismatch if the "auto" target selects a non-CUDA backend. Consider either:
- Adding
@tilelang.testing.requires_cudatotest_tilelang_copy()(line 44) to ensure CUDA is available, or- Making the device selection dynamic based on the compiled target.
testing/python/language/test_tilelang_language_let.py (1)
6-6: LGTM: Appropriate CUDA requirement decorator.The
@tilelang.testing.requires_cudadecorator correctly gates this test, which explicitly compiles for the "cuda" target (line 18) and verifies CUDA-specific vectorization codegen (line 19).testing/python/language/test_tilelang_language_ptr.py (1)
44-52: Verify device compatibility with "auto" target.The compilation target has been changed to
"auto", but the test execution inrun_matmul()creates all tensors withdevice="cuda"(lines 49-50, 52). This creates a potential runtime mismatch if the "auto" target selects a non-CUDA backend. Consider adding@tilelang.testing.requires_cudadecorator totest_matmul()at line 61 to ensure CUDA availability before running this test.testing/python/carver/test_tilelang_carver_recommend_hints.py (1)
136-138: Consider gating other template tests for consistency.Only
test_fmha_recommend_hints()has the@tilelang.testing.requires_cudadecorator, while other template tests (test_general_reduction_recommend_hints,test_elementwise_recommend_hints,test_matmul_recommend_hints,test_gemv_recommend_hints) do not. If Flash Attention has specific CUDA requirements that other templates don't have, this is correct. Otherwise, consider whether the other tests should also be gated for consistency, especially since they all useauto_infer_current_arch()which may assume GPU availability.testing/python/language/test_tilelang_language_alias.py (1)
48-50: No changes needed. The "auto" target in tilelang is specifically designed to automatically detect available devices (CUDA, HIP, or Metal) at runtime through thedetermine_target()function intilelang/utils/target.py. The kernel compilation and execution will use whichever device is available on the system, making explicit@requires_cudadecorators unnecessary. This aligns with the codebase pattern where all language tests usingtarget="auto"(test_tilelang_language_clear.py, test_tilelang_language_mask_op.py, test_tilelang_language_unroll.py, test_tilelang_language_copy.py, etc.) intentionally omit CUDA requirement decorators.Likely an incorrect or invalid review comment.
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
45-47: LGTM - Appropriate CUDA gating.The
@requires_cudadecorator correctly gates this test to CUDA-enabled environments, consistent with the device-specific operations in the test.testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
29-33: LGTM - Appropriate CUDA requirement guards.All test functions now correctly require CUDA availability since they specifically test CUDA driver properties via
torch.cudaand CUDA device attributes. The decorators ensure tests only run in appropriate environments.Also applies to: 36-40, 43-47, 50-54, 57-61, 64-68, 71-75
testing/python/kernel/test_tilelang_kernel_gemm.py (1)
106-107: LGTM - Selective CUDA gating for specific GEMM configurations.The selective application of
@requires_cudadecorators indicates intentional differentiation between CUDA-specific GEMM configurations and those that can run across multiple backends. This pattern appropriately gates tests that rely on CUDA-specific optimizations or behavior.Also applies to: 172-173, 190-191, 216-217, 255-256, 273-274
testing/python/language/test_tilelang_language_unroll.py (1)
6-33: LGTM - Appropriate differentiation for backend-specific unroll support.The changes correctly handle backend-specific unroll pragma support:
test_unroll_with_step: Usestarget="auto"without decorator, allowing execution on both CUDA and HIP (basic#pragma unrollis widely supported)test_unroll_with_unroll_factor: Usestarget="auto"with@requires_cudadecorator, gating execution to CUDA-only since HIP doesn't support parameterized unroll factorsThis pattern demonstrates the correct approach for handling backend-specific feature availability while still enabling automatic target selection where possible.
src/op/logical.cc (1)
40-54: LGTM - Correct HIP backend support for logical operators.The HIP intrinsic lowering registrations correctly extend multi-backend support for the
tl.any_ofandtl.all_ofoperators:
- Both operators now support
hip.FLowerIntrinsicalongside existingcuda.FLowerIntrinsic- The same lowering functions (
any_of_op,all_of_op) are appropriately shared between backends- The lowering generates backend-agnostic external calls (
tl::Any,tl::All)This pattern aligns with the broader PR objective of introducing HIP/ROCm support.
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (1)
199-199: LGTM! Appropriate CUDA gating for tensor core tests.The
@tilelang.testing.requires_cudadecorators correctly gate tests that use CUDA-specific MMA tensor core operations.Also applies to: 403-403
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)
156-156: LGTM! Appropriate CUDA gating for sparse GEMM tests.The
@tilelang.testing.requires_cudadecorators correctly gate tests for CUDA-specific sparse tensor core operations.Also applies to: 308-308, 459-459, 614-614
tilelang/engine/lower.py (1)
15-15: LGTM! Refactoring to centralized path management.The changes correctly replace environment-variable based path resolution with centralized constants from
tilelang.env, improving maintainability and consistency across CUDA and HIP compilation paths.Also applies to: 76-77, 126-127
tilelang/jit/adapter/wrapper.py (2)
314-316: LGTM! Backend-specific declaration parsing.The new
get_declarationmethods appropriately handle the different declaration styles between CUDA (which has forward declarations ending with;) and HIP (which doesn't, so splits on{). This refactoring improves code organization and backend compatibility.Also applies to: 612-615
585-586: LGTM! ROCm FP8 type mapping additions.The type mappings for
float8_e5m2fnuzanduint64correctly support ROCm's FP8 types and standard unsigned 64-bit integers.Also applies to: 592-592
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
100-100: LGTM! Appropriate CUDA gating for NVRTC tests.All tests using the NVRTC execution backend correctly require CUDA availability, as NVRTC is CUDA-specific.
Also applies to: 221-221, 273-273, 316-316, 366-366, 438-438, 452-452
src/tl_templates/hip/common.h (3)
113-130: LGTM! Generic helper utilities.The
AnyandAllhelper functions correctly implement short-circuit evaluation for boolean array checks.
151-198: LGTM! Proper type casting for half and bfloat16 shuffle operations.The specializations for
half_tandbfloat16_tcorrectly cast tofloatfor shuffle operations and cast back, which is necessary since hardware shuffle instructions typically operate on 32-bit values.
110-200: Address HIP shuffle function synchronization gap.The HIP implementation uses non-sync warp shuffle functions (
__shfl_xor,__shfl_down,__shfl_up,__shfl) without mask parameters, while the CUDA counterparts use explicit_syncvariants with mask support (e.g.,shfl_xor_sync(uint32_t(-1), ...)). The TODO comment acknowledges ROCm 7.1.1 providesshfl_syncsupport, but the implementation has not yet been updated.Verify that:
- HIP's non-sync shuffle intrinsics provide implicit or sufficient synchronization guarantees for warp-level operations in your use cases
- The current approach in
src/tl_templates/hip/reduce.h(which callstl::shfl_*without masks) is safe and equivalent to the CUDA masked approach, or update to use_syncvariants once ROCm support is confirmedtesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)
114-119: Verify accumulator dtype change from float16 to float32.Multiple test cases have been updated to use
T.float32fordtypeAccuminstead ofT.float16. While this change may improve numerical accuracy, it could mask precision issues that would occur with lower precision accumulation.Ensure this change aligns with your testing strategy and doesn't inadvertently reduce test coverage for float16 accumulation paths.
Also applies to: 272-276, 429-433, 593-596
src/tl_templates/hip/atomic.h (2)
7-20: LGTM! Appropriate overloads for pointer and reference operands.The template overloads correctly handle both pointer (
T1*) and reference (T1&) first arguments, providing flexibility for different addressing patterns during code generation.
64-104: LGTM! Vectorized atomic operations for multi-component types.The vectorized variants (
AtomicAddx2,AtomicAddx4) correctly decomposefloat2andfloat4operations into individual component-wise atomics, with appropriate return value variants.src/tl_templates/hip/debug.h (1)
215-230: LGTM: Trait-based design is a good improvement.The
PrintTraitsprimary template with fallback behavior and macro-generated specializations provides clean extensibility and reduces code duplication.testing/python/debug/test_device_assert.py (1)
16-16: LGTM: Target changed to "auto" for automatic backend selection.The change from
target="cuda"totarget="auto"enables this test to run on both CUDA and ROCm backends, aligning with the PR's multi-backend support goal.Also applies to: 28-28
testing/python/language/test_tilelang_language_clear.py (1)
44-51: Inconsistency:target="auto"but tensors use.cuda()explicitly.The compilation target is set to
"auto"for backend-agnostic selection, but tensor creation at lines 48-49 uses.cuda()which is CUDA-specific. On ROCm, this may still work (PyTorch ROCm maps.cuda()to HIP), but for clarity and consistency, consider using a device-agnostic approach.Verify that
.cuda()works correctly on ROCm environments, or consider using a utility function that abstracts the device selection.testing/python/language/test_tilelang_language_alloc.py (2)
36-36: LGTM: Assertion relaxation accommodates code generation variations.The updated assertions accept both
"tmp ="and"tmp[0] ="patterns, accommodating differences in how variables are generated across backends.Also applies to: 76-76
118-121: LGTM: Appropriate CUDA gating for unsupported ROCm features.The
@tilelang.testing.requires_cudadecorator correctly gates tests that depend onalloc_varwith initializer, which is noted as unsupported on ROCm. The TODO comments provide useful context.Also applies to: 159-162
testing/python/autotune/test_tilelang_autotune.py (1)
263-266: LGTM: CUDA requirement decorators appropriately gate autotuner tests.The autotuner tests rely on CUDA-specific carver infrastructure (e.g.,
CUDA("cuda")at line 52). The@tilelang.testing.requires_cudadecorators correctly prevent these tests from running on non-CUDA environments.Also applies to: 269-272
testing/python/language/test_tilelang_language_copy.py (1)
32-39: LGTM: Target changed to "auto" for multi-backend support.The compilation targets are updated to
"auto"consistently across the copy test functions. Note that tensor creation still usesdevice="cuda"- this works on ROCm via PyTorch's HIP mapping, but consider abstracting device selection for better clarity if expanding to more backends.Also applies to: 69-79, 134-139
testing/python/language/test_tilelang_language_frontend_v2.py (1)
339-341: LGTM: Consistent CUDA gating foralloc_varwith initializer.The
test_while_loopfunction usesT.alloc_var(T.int32, 0)with an initializer value, which is noted as unsupported on ROCm. The decorator appropriately gates this test.src/target/codegen_hip.cc (2)
945-947: LGTM: FP8 E5M2 dtype mappings added.The new mappings for
float8_e5m2fnuzx4andfloat8_e5m2fnuzx8follow the existing pattern established forfloat8_e4m3fnuzvariants. Using"long"for the 8-element vector is consistent with the e4m3 variant.
985-995: Warp reduction implementations are correctly in place.All five warp reduction functions (
warp_reduce_sum,warp_reduce_max,warp_reduce_min,warp_reduce_bitand,warp_reduce_bitor) are implemented insrc/tl_templates/hip/reduce.hat lines 273, 277, 281, 285, and 289 respectively. The handlers incodegen_hip.cccorrectly delegate to thesetl::namespace functions.src/tl_templates/cuda/reduce.h (1)
177-186: LGTM!The changes correctly parameterize thread indexing via
SEG(defaulting to 32 for CUDA's warp size) and align the return type with the HIP counterpart. The logic forlane = tid % SEGandrow = tid / SEGis consistent withCumSum1Din the same file.testing/python/debug/test_tilelang_debug_print.py (1)
20-34: Good refactoring with pytest parameterization.Clean separation of platform-agnostic dtype tests via
@pytest.mark.parametrizeand platform-specific FP8 tests using@tilelang.testing.requires_cuda/@tilelang.testing.requires_rocmdecorators.src/tl_templates/hip/reduce.h (3)
117-182: LGTM!The
CumSum1Dimplementation correctly usestl::shfl_*wrappers for type-safe shuffle operations and appropriately defaultsSEG=64for HIP's 64-wide wavefronts. The static_assert correctly excludesthreads==32which would be insufficient for the HIP architecture.
184-260: LGTM!The
CumSum2Dimplementation is well-structured with correct use oftl::shfl_*wrappers and appropriateSEG=64default for HIP. The axis and reverse handling logic is consistent with the CUDA version insrc/tl_templates/cuda/reduce.h.
75-78: Good refactoring to usetl::wrappers.The updates to use
tl::shfl_downandtl::shfl_xorinSharedReduceWarpandAllReduceensure proper handling ofhalf_tandbfloat16_ttypes through the specializations incommon.h.Also applies to: 106-108
| ./python/jit/test_tilelang_jit_cutedsl.py | ||
|
|
||
| # AMD ROCm tests | ||
| # runtime and transform tests needs to repair, then rm it from ignore list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammar and add tracking reference.
The comment has a grammatical error: "needs to repair" should be "need to be repaired" or "need repair". Additionally, consider adding a TODO marker or issue reference to track when these directories can be un-ignored.
🔎 Suggested fix
- # runtime and transform tests needs to repair, then rm it from ignore list
+ # TODO(#XXXX): runtime and transform tests need to be repaired, then remove from ignore listReplace #XXXX with an actual issue number. Would you like me to help create a tracking issue for this technical debt?
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # runtime and transform tests needs to repair, then rm it from ignore list | |
| # TODO(#XXXX): runtime and transform tests need to be repaired, then remove from ignore list |
🤖 Prompt for AI Agents
In .github/workflows/ci.yml around line 395, the inline comment "runtime and
transform tests needs to repair, then rm it from ignore list" is grammatically
incorrect and lacks tracking: change it to a clear TODO such as "# TODO: runtime
and transform tests need to be repaired; remove from ignore list — see
ISSUE-XXXX" (or "# TODO: runtime and transform tests need repair; remove from
ignore list — see #1234"), and replace ISSUE-XXXX/#1234 with an actual issue
number or create one and reference it.
| DEFINE_PRINT_TRAIT(char, "char", "%d", int); | ||
| DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int); | ||
| DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int); | ||
| DEFINE_PRINT_TRAIT(short, "short", "%d", int); | ||
| DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int); | ||
| DEFINE_PRINT_TRAIT(int, "int", "%d", int); | ||
| DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int); | ||
| DEFINE_PRINT_TRAIT(long, "long", "%ld", long); | ||
| DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long); | ||
| DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing unsigned long long specialization.
The trait defines long long but not unsigned long long. This could cause the fallback template to be used for uint64_t values, printing them as pointers instead of their actual values.
🔎 Proposed fix
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
+DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DEFINE_PRINT_TRAIT(char, "char", "%d", int); | |
| DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(short, "short", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(int, "int", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(long, "long", "%ld", long); | |
| DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long); | |
| DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); | |
| DEFINE_PRINT_TRAIT(char, "char", "%d", int); | |
| DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(short, "short", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(int, "int", "%d", int); | |
| DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int); | |
| DEFINE_PRINT_TRAIT(long, "long", "%ld", long); | |
| DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long); | |
| DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); | |
| DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long); |
🤖 Prompt for AI Agents
In src/tl_templates/hip/debug.h around lines 251 to 260, the print-trait list
defines long long but omits unsigned long long, which can cause uint64_t to fall
back to the pointer-printing trait; add a DEFINE_PRINT_TRAIT entry for unsigned
long long using the name "unsigned long long", the "%llu" format specifier, and
unsigned long long as the cast/print type (mirror the pattern used for other
unsigned types).
| template <typename T, typename ReduceOp> | ||
| TL_DEVICE T warp_reduce(T value, ReduceOp op) { | ||
| value = op(value, __shfl_xor(value, 32)); | ||
| value = op(value, __shfl_xor(value, 16)); | ||
| value = op(value, __shfl_xor(value, 8)); | ||
| value = op(value, __shfl_xor(value, 4)); | ||
| value = op(value, __shfl_xor(value, 2)); | ||
| value = op(value, __shfl_xor(value, 1)); | ||
| return value; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tl::shfl_xor instead of raw __shfl_xor for type safety.
The warp_reduce function uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results or fail for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper (defined in common.h lines 148-177) that converts to float before shuffling.
This is inconsistent with the rest of the file where tl::shfl_* wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).
🔎 Proposed fix
template <typename T, typename ReduceOp>
TL_DEVICE T warp_reduce(T value, ReduceOp op) {
- value = op(value, __shfl_xor(value, 32));
- value = op(value, __shfl_xor(value, 16));
- value = op(value, __shfl_xor(value, 8));
- value = op(value, __shfl_xor(value, 4));
- value = op(value, __shfl_xor(value, 2));
- value = op(value, __shfl_xor(value, 1));
+ value = op(value, tl::shfl_xor(value, 32));
+ value = op(value, tl::shfl_xor(value, 16));
+ value = op(value, tl::shfl_xor(value, 8));
+ value = op(value, tl::shfl_xor(value, 4));
+ value = op(value, tl::shfl_xor(value, 2));
+ value = op(value, tl::shfl_xor(value, 1));
return value;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| template <typename T, typename ReduceOp> | |
| TL_DEVICE T warp_reduce(T value, ReduceOp op) { | |
| value = op(value, __shfl_xor(value, 32)); | |
| value = op(value, __shfl_xor(value, 16)); | |
| value = op(value, __shfl_xor(value, 8)); | |
| value = op(value, __shfl_xor(value, 4)); | |
| value = op(value, __shfl_xor(value, 2)); | |
| value = op(value, __shfl_xor(value, 1)); | |
| return value; | |
| } | |
| template <typename T, typename ReduceOp> | |
| TL_DEVICE T warp_reduce(T value, ReduceOp op) { | |
| value = op(value, tl::shfl_xor(value, 32)); | |
| value = op(value, tl::shfl_xor(value, 16)); | |
| value = op(value, tl::shfl_xor(value, 8)); | |
| value = op(value, tl::shfl_xor(value, 4)); | |
| value = op(value, tl::shfl_xor(value, 2)); | |
| value = op(value, tl::shfl_xor(value, 1)); | |
| return value; | |
| } |
🤖 Prompt for AI Agents
In src/tl_templates/hip/reduce.h around lines 262 to 271, the warp_reduce
implementation uses raw HIP intrinsic __shfl_xor which breaks for
half_t/bfloat16_t; replace each __shfl_xor call with the type-safe wrapper
tl::shfl_xor (e.g., tl::shfl_xor(value, 32), etc.) so the wrapper in common.h
handles conversion to float for those types; keep the same lane masks and op
logic unchanged.
| @tilelang.testing.requires_cuda | ||
| def test_atomic_addx2_half(): | ||
| run_atomic_addx2(32, 64, 8, 16, dtype=T.float16) | ||
|
|
||
|
|
||
| def test_atomic_addx2_float(): | ||
| run_atomic_addx2(32, 64, 8, 16, dtype=T.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing @requires_cuda decorator for consistency.
test_atomic_addx2_half (Line 248) is decorated with @tilelang.testing.requires_cuda, but test_atomic_addx2_float (Line 253) is not. Both tests use the same underlying atomic_addx2 operation, which appears to be CUDA-specific. This inconsistency may cause test_atomic_addx2_float to fail on non-CUDA platforms.
🔎 Proposed fix
+@tilelang.testing.requires_cuda
def test_atomic_addx2_float():
run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines 248
to 254, the test function test_atomic_addx2_float is missing the
@tilelang.testing.requires_cuda decorator (test_atomic_addx2_half has it),
causing inconsistency and possible failures on non-CUDA hosts; add the
@tilelang.testing.requires_cuda decorator immediately above the def
test_atomic_addx2_float() line so both CUDA-specific tests are consistently
skipped when CUDA is unavailable.
| @pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved") | ||
| @tilelang.testing.requires_cuda | ||
| @pytest.mark.parametrize( | ||
| "M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads", | ||
| [ | ||
| (128, 128, 128, True, True, T.float8_e5m2, T.float8_e5m2, T.float32, 128, 128, 32, 2, 128), | ||
| (128, 128, 128, True, True, T.float8_e4m3, T.float8_e4m3, T.float32, 128, 128, 32, 2, 128), | ||
| ], | ||
| ) | ||
| def test_gemm_ss_fp8_cuda(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads): | ||
| run_gemm_ss(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads) | ||
|
|
||
|
|
||
| @pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved") | ||
| @tilelang.testing.requires_rocm | ||
| @pytest.mark.parametrize( | ||
| "M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads", | ||
| [ | ||
| (128, 128, 128, True, True, T.float8_e5m2fnuz, T.float8_e5m2fnuz, T.float32, 128, 128, 32, 2, 128), | ||
| (128, 128, 128, True, True, T.float8_e4m3fnuz, T.float8_e4m3fnuz, T.float32, 128, 128, 32, 2, 128), | ||
| ], | ||
| ) | ||
| def test_gemm_ss_fp8_rocm(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads): | ||
| run_gemm_ss(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Document FP8 test status and unskip plan.
The new FP8 GEMM tests for both CUDA and ROCm are marked with @pytest.mark.skip. The skip reason indicates temporary disabling until GEMM SS/RS/SR/RR issues are resolved. Several ROCm FP8 tests also have TODO comments about precision problems.
Consider:
- Creating tracking issues for the known problems referenced in the skip reasons
- Adding issue references in the skip decorators for traceability
- Documenting the expected timeline for enabling these tests
Example with issue tracking:
@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX")Also applies to: 291-315, 450-473, 615-638
🤖 Prompt for AI Agents
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py lines 133-157: the
FP8 GEMM tests are skipped with a generic reason; create tracking issues for the
known GEMM SS/RS/SR/RR and ROCm precision problems, then update each
@pytest.mark.skip reason to include the corresponding issue number (e.g.,
"Temporarily disabling until GEMM SS issues are resolved. See issue #<id>") for
traceability, and add a short comment above each skipped block documenting the
expected timeline and next steps to re-enable the tests.
0e5da23 to
a9f8c40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (2)
12-12: Move torch import inside test functions or handle ImportError.The module-level
torchimport will fail before the@tilelang.testing.requires_cudadecorator can skip tests when torch is not installed. Since the decorator only gates test execution (not module import), an ImportError here prevents the test module from being collected at all.Consider either:
- Moving the torch import inside test functions that need it
- Wrapping the import in a try-except block at module level
- Using pytest's
importorskipif available🔎 Example fix: Conditional import
-import torch +try: + import torch +except ImportError: + torch = NoneThen each test can rely on the
@requires_cudadecorator to skip when torch/CUDA is unavailable.
71-76: The functionget_max_dynamic_shared_size_bytes()appears to have a semantic mismatch between its name/documentation and its implementation.The function name and docstring suggest it returns the maximum dynamic shared memory that can be requested per block. However, the implementation uses
cudaDevAttrMaxSharedMemoryPerMultiprocessor(attribute 81), which returns the total shared memory capacity of a multiprocessor—not the per-block dynamic allocation limit.These are different CUDA resource limits:
- Per-block dynamic memory: Maximum shared memory a single block can request (typically 48KB–96KB depending on compute capability)
- Per-multiprocessor total: Total shared memory available on an SM, shared among all blocks (96KB–192KB+)
Consider whether the function should instead use
cudaDevAttrMaxSharedMemoryPerBlock(attribute 8) to match its semantic intent.testing/python/language/test_tilelang_language_infinity.py (1)
24-24: Remove@requires_cudaor add a TODO explaining the limitation.This test uses the basic
T.infinity()operation with no documented ROCm limitation, unlike other CUDA-restricted tests (unroll, var_init, alloc_var) which include TODO comments. Similar platform-agnostic operations likeclamp,clear, and others lack this decorator. Either remove it to allow ROCm testing, or add a TODO comment explaining the platform-specific constraint.testing/python/language/test_tilelang_language_atomic_add.py (1)
226-236: Add@tilelang.testing.requires_cudadecorators to undecorated atomic tests that use CUDA.This PR adds the
@tilelang.testing.requires_cudadecorator to several tests but misses others that also call.cuda(). The following tests use CUDA and require the decorator:
test_atomic_add(line 226)test_atomic_max(line 230)test_atomic_min(line 234)test_atomic_addx4(line 360)test_atomic_return_prev(line 364)test_tile_atomic_add(line 368)All underlying
run_*functions call.cuda()directly, so tests will fail on non-CUDA systems without decoration.testing/python/language/test_tilelang_language_composable_index.py (1)
39-39: Critical: Hardcoded CUDA device conflicts with auto target selection.The PyTorch tensor is created with
device="cuda", but the kernel is compiled withtarget="auto"(line 33). This inconsistency will cause the test to fail on ROCm systems where no CUDA device exists.🔎 Proposed fix to use dynamic device selection
- a = torch.randn(M, N, device="cuda", dtype=getattr(torch, dtype)) + device = "cuda" if torch.cuda.is_available() else "hip" + a = torch.randn(M, N, device=device, dtype=getattr(torch, dtype))Alternatively, if there's a project utility for device detection, use that instead.
♻️ Duplicate comments (7)
testing/python/language/test_tilelang_language_annot.py (2)
31-32: Same TODO clarity issue as above.The same misleading wording ("disable for now") appears here. Apply the same fix as suggested for lines 7-8.
55-56: Same TODO clarity issue as above.The same misleading wording ("disable for now") appears here. Apply the same fix as suggested for lines 7-8.
src/tl_templates/hip/reduce.h (1)
262-271: Usetl::shfl_xorinstead of raw__shfl_xorfor type safety.The
warp_reducefunction uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results forhalf_tandbfloat16_ttypes. These types require thetl::shfl_xorwrapper (defined incommon.h) that converts to float before shuffling.This is inconsistent with the rest of the file where
tl::shfl_*wrappers are correctly used (lines 76, 107, 142, 152, etc.).🔎 Proposed fix
template <typename T, typename ReduceOp> TL_DEVICE T warp_reduce(T value, ReduceOp op) { - value = op(value, __shfl_xor(value, 32)); - value = op(value, __shfl_xor(value, 16)); - value = op(value, __shfl_xor(value, 8)); - value = op(value, __shfl_xor(value, 4)); - value = op(value, __shfl_xor(value, 2)); - value = op(value, __shfl_xor(value, 1)); + value = op(value, tl::shfl_xor(value, 32)); + value = op(value, tl::shfl_xor(value, 16)); + value = op(value, tl::shfl_xor(value, 8)); + value = op(value, tl::shfl_xor(value, 4)); + value = op(value, tl::shfl_xor(value, 2)); + value = op(value, tl::shfl_xor(value, 1)); return value; }src/tl_templates/hip/debug.h (1)
40-49: Missingunsigned long longspecialization.The trait defines
long longbut notunsigned long long. This could cause the fallback template to be used foruint64_tvalues, printing them as pointers instead of their actual values.🔎 Proposed fix
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); +DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);testing/python/language/test_tilelang_language_alloc.py (1)
159-160: Same tracking issue suggestion applies here.As with line 118-119, consider adding a tracking issue reference to ensure ROCm support for multiple variable initialization is tracked and implemented.
testing/python/issue/test_tilelang_issue_1008.py (1)
49-53: LGTM! Same consideration as above.The
@tilelang.testing.requires_cudadecorator correctly gates this test for CUDA availability, matching the hardcodeddevice="cuda"on line 52. This prevents failures in ROCm-only CI environments.The same consideration applies: since the kernel logic is device-agnostic and the PR adds ROCm support, consider whether ROCm test coverage should be added for this issue.
testing/python/language/test_tilelang_language_atomic_add.py (1)
253-254: Add missing@requires_cudadecorator (previously flagged).
test_atomic_addx2_floatis missing the@tilelang.testing.requires_cudadecorator, whiletest_atomic_addx2_half(line 248) has it. Both tests invokerun_atomic_addx2, which calls.cuda()and will fail on non-CUDA platforms. This issue was raised in a previous review but remains unaddressed.🔎 Required fix
+@tilelang.testing.requires_cuda def test_atomic_addx2_float(): run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
🧹 Nitpick comments (8)
testing/python/language/test_tilelang_language_var_init.py (1)
6-7: CUDA gating aligns with the HIP limitation.The decorator correctly restricts this test to CUDA since var init is not yet supported on HIP/ROCm.
Optionally, consider making the TODO more actionable by:
- Referencing a tracking issue number (if one exists)
- Briefly noting what blocks HIP support (e.g., missing codegen, runtime limitation)
This would help future contributors understand the scope of work needed.
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
133-156: Inconsistent skip decorator usage across FP8 tests.The FP8 tests for SS and RS variants are marked with
@pytest.mark.skip, but the SR and RR FP8 tests (lines 448-470, 612-634) are not skipped. This creates inconsistency in which FP8 tests run by default.Additionally, as noted in past review comments, consider adding issue tracking references to the skip reasons for better traceability.
Verify whether the SR and RR FP8 tests should also be skipped for consistency, or if the SS and RS tests can be unskipped.
Also applies to: 290-313
432-441: TODO comments lack issue tracking references.Multiple TODO comments reference precision problems on ROCm but don't include tracking issue numbers. Consider creating issues for these known problems and referencing them in the comments for better traceability and project management.
Also applies to: 463-465, 595-597, 627-629
testing/python/debug/test_tilelang_debug_print.py (1)
50-50: Approved: Explicit target="auto" for cross-platform support.The explicit
target="auto"specification aligns with the PR objective to enable ROCm CI testing and ensures these tests run on the appropriate available hardware.Note: Line 15 in
debug_print_bufferdoesn't explicitly specifytarget="auto". While it defaults to "auto" per the documentation, consider adding it for consistency with the rest of the file.🔎 Optional consistency fix for line 15
- jit_kernel = tilelang.compile(program) + jit_kernel = tilelang.compile(program, target="auto")Also applies to: 69-69, 88-88, 107-107
testing/python/language/test_tilelang_language_alloc.py (1)
118-119: Consider adding a tracking issue reference to the TODO.The CUDA-only gating is reasonable given incomplete ROCm support for variable initialization. However, the TODO comment would benefit from a reference to a tracking issue to ensure ROCm support isn't forgotten.
Example format:
# TODO(Gong): ROCm is not supported yet, disable for now (tracked in #XXXX)testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
29-33: Consider avoiding private torch API.Line 33 references
torch.cuda._CudaDeviceProperties, which is a private class (indicated by the underscore prefix). Private APIs can change without notice in future torch versions.If type validation is essential, consider checking for the presence of expected attributes instead, or use duck typing:
assert hasattr(prop, 'name') and hasattr(prop, 'multi_processor_count'), \ "Returned object does not have expected CUDA device properties"testing/python/language/test_tilelang_language_atomic_add.py (2)
198-198: Consider using consistent dtype defaults across all atomic operation functions.The
atomic_addx2_programandrun_atomic_addx2functions now default todtype=T.float16, while all other atomic operation functions in this file default todtype=T.float32(e.g.,atomic_add_program,atomic_max_program,atomic_min_program, etc.). This inconsistency may confuse users expecting uniform behavior across the API.🔎 Proposed fix to align with float32 default
-def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16): +def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float32):-def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float16): +def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float32):If float16 is the intended default for this specific operation, consider documenting the rationale.
Also applies to: 210-210
214-215: Simplify tensor creation by using the target dtype directly.These lines create tensors in
float32and then convert to the target dtype. Other functions in this file (e.g.,run_atomic_addat lines 30-31,run_atomic_maxat lines 98-99) create tensors directly with the target dtype usingdtype=getattr(torch, dtype), which is more efficient and consistent.🔎 Proposed simplification
- A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) - B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) + A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() + B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (52)
.github/workflows/ci.ymlsrc/op/gemm.ccsrc/op/logical.ccsrc/target/codegen_hip.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/atomic.hsrc/tl_templates/hip/common.hsrc/tl_templates/hip/debug.hsrc/tl_templates/hip/hip_fp8.hsrc/tl_templates/hip/reduce.htesting/python/autotune/test_tilelang_autotune.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/debug/test_device_assert.pytesting/python/debug/test_tilelang_debug_print.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/issue/test_tilelang_issue_830.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_alias.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_composable_index.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytilelang/engine/lower.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (1)
- testing/python/issue/test_tilelang_issue_830.py
✅ Files skipped from review due to trivial changes (2)
- src/tl_templates/hip/hip_fp8.h
- src/tl_templates/hip/atomic.h
🚧 Files skipped from review as they are similar to previous changes (15)
- .github/workflows/ci.yml
- tilelang/intrinsics/mfma_macro_generator.py
- src/op/gemm.cc
- testing/python/issue/test_tilelang_issue_1001.py
- testing/python/language/test_tilelang_language_ptr.py
- testing/python/language/test_tilelang_language_frontend_v2.py
- testing/python/jit/test_tilelang_jit_parcompile.py
- testing/python/jit/test_tilelang_jit_tvm_ffi.py
- testing/python/language/test_tilelang_language_let.py
- testing/python/kernel/test_tilelang_kernel_gemm.py
- testing/python/language/test_tilelang_language_vectorized_cast.py
- testing/python/language/test_tilelang_language_alias.py
- tilelang/intrinsics/mfma_layout.py
- testing/python/issue/test_tilelang_issue_96.py
- tilelang/jit/adapter/wrapper.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/autotune/test_tilelang_autotune.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/language/test_tilelang_language_var_init.pytilelang/engine/lower.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_clear.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.
Applied to files:
testing/python/language/test_tilelang_language_copy.pytesting/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Applied to files:
src/tl_templates/hip/reduce.h
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
testing/python/issue/test_tilelang_issue_1008.pytilelang/engine/lower.py
🧬 Code graph analysis (14)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
testing/python/debug/test_device_assert.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/debug/test_tilelang_debug_print.py (2)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/language/test_tilelang_language_copy.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
run(121-181)run(188-259)
testing/python/kernel/test_tilelang_kernel_gemm_simt.py (1)
testing/python/kernel/test_tilelang_kernel_gemm.py (1)
matmul(6-49)
testing/python/language/test_tilelang_language_unroll.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
tl(110-200)shfl_down(138-140)shfl_down(157-161)shfl_down(182-186)shfl_xor(134-136)shfl_xor(151-155)shfl_xor(176-180)shfl(146-148)shfl(169-173)shfl(194-198)shfl_up(142-144)shfl_up(163-167)shfl_up(188-192)
testing/python/jit/test_tilelang_jit_callback.py (2)
tilelang/engine/callback.py (2)
register_cuda_postproc_callback(44-74)register_hip_postproc_callback(77-107)tilelang/language/v2/dtypes.py (1)
float32(300-300)
testing/python/language/test_tilelang_language_composable_index.py (1)
tilelang/language/ast/ir.py (1)
target(1677-1707)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
float32(300-300)
src/target/codegen_hip.cc (1)
tilelang/language/reduce_op.py (5)
warp_reduce_sum(387-400)warp_reduce_max(403-416)warp_reduce_min(419-432)warp_reduce_bitand(435-448)warp_reduce_bitor(451-464)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
testing/python/amd/test_tilelang_test_amd.py (1)
run_gemm_rs(195-239)maint/gemm_v2/correctness_evaluation.py (2)
run_gemm_rs(178-210)run_gemm_sr(262-295)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
tilelang/carver/arch/driver/cuda_driver.py (7)
get_cuda_device_properties(22-30)get_device_name(33-36)get_shared_memory_per_block(39-52)get_device_attribute(55-78)get_persisting_l2_cache_max_size(97-99)get_num_sms(102-118)get_registers_per_block(121-129)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (57)
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
45-47: LGTM!The
@tilelang.testing.requires_cudadecorator is appropriate here since the test explicitly targets CUDA (target="cuda"on line 32) and allocates tensors on the CUDA device (device="cuda"on line 34). This ensures the test is skipped gracefully in environments without CUDA support, aligning with the broader test-gating pattern in this PR.testing/python/jit/test_tilelang_jit_gemm_cython.py (1)
169-169: LGTM! Accumulation dtype upgraded to float32 for improved numerical stability.The change from
T.float16toT.float32for the accumulation dtype across all GEMM tests is appropriate. Using higher-precision accumulation while keeping float16 inputs/outputs is a common practice that improves numerical accuracy without significantly impacting memory bandwidth, especially important for ROCm/HIP support.Also applies to: 211-211, 255-255, 304-304, 306-306, 308-308, 357-357
testing/python/jit/test_tilelang_jit_gemm.py (1)
106-106: LGTM! Test name and accumulation dtype updated consistently.The function name change from
test_gemm_f16f16f16_nn_kernel_jittotest_gemm_f16f16f32_nn_kernel_jitcorrectly reflects the accumulation dtype upgrade toT.float32. This maintains naming consistency and improves test clarity.Also applies to: 115-115
testing/python/jit/test_tilelang_jit_cutedsl.py (1)
158-158: LGTM! CUDA gating correctly applied.The
@tilelang.testing.requires_cudadecorators correctly gate these CuTeDSL tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.Also applies to: 210-210, 253-253, 303-303
testing/python/autotune/test_tilelang_autotune.py (1)
263-263: LGTM! CUDA gating correctly applied.The
@tilelang.testing.requires_cudadecorators correctly gate these autotuning tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.Also applies to: 269-269
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)
156-156: LGTM! CUDA gating correctly applied.The
@tilelang.testing.requires_cudadecorators correctly gate these sparse GEMM v2 tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.Also applies to: 308-308, 459-459, 614-614
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)
114-118: LGTM! Improved numerical stability with float32 accumulation.The change from
T.float16toT.float32fordtypeAccumin multiple test parameterizations improves numerical accuracy for float16 GEMM operations by using higher-precision accumulation.Also applies to: 271-275, 427-431, 590-593
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
158-158: LGTM! CUDA gating correctly applied.The
@tilelang.testing.requires_cudadecorators correctly gate these NVRTC backend tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.Also applies to: 210-210, 253-253, 303-303, 375-375, 389-389
testing/python/language/test_tilelang_language_clear.py (1)
44-44: LGTM! Target change enables automatic backend selection.The change from
target="cuda"totarget="auto"aligns with the PR objectives to support ROCm and automatic backend detection.testing/python/language/test_tilelang_language_copy.py (4)
32-32: LGTM! Target change enables automatic backend selection.The change from
target="cuda"totarget="auto"aligns with the PR objectives to support ROCm and automatic backend detection.
69-69: LGTM! Consistent target change.
134-134: LGTM! Consistent target change.
146-151: Verify target parameter consistency for FP8 tests.The FP8 tests at lines 148 and 167 (in
run_tilelang_copy_fp8_e8m0andrun_tilelang_copy_fp4) don't specify atargetparameter intilelang.compile(), unlike the other test functions in this file that now usetarget="auto". Confirm whether this omission is intentional or if these should also specifytarget="auto"for consistency.testing/python/debug/test_device_assert.py (1)
16-16: LGTM! Target changes enable automatic backend selection.The changes from
target="cuda"totarget="auto"at lines 16 and 28 align with the PR objectives.testing/python/language/test_tilelang_language_mask_op.py (1)
32-32: LGTM! Target changes enable automatic backend selection.All four occurrences (lines 32, 68, 105, 141) consistently change from
target="cuda"totarget="auto", aligning with the PR objectives to support ROCm and automatic backend detection.testing/python/language/test_tilelang_language_unroll.py (2)
16-16: LGTM! Target change enables automatic backend selection.
20-33: New CUDA-specific unroll test added.The new test correctly verifies unroll pragma generation with an unroll factor. The combination of
@tilelang.testing.requires_cudadecorator andtarget="auto"is appropriate here: the decorator ensures the test runs only when CUDA is available (since unroll_factor is not supported on HIP per the TODO), whiletarget="auto"allows the compilation to auto-detect the backend within that CUDA environment.testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)
157-157: Correct bug fix: reference should use out_dtype.The kernel returns a tensor of type
out_dtype(see line 81), notaccum_dtype. This change correctly aligns the reference computation with the actual kernel output type.
168-170: Good addition: int8 test with proper CUDA gating.The new test adds coverage for the int8 data path with int32 accumulation, and the
@tilelang.testing.requires_cudadecorator appropriately gates the test for CUDA availability.
164-164: No action needed. The test at line 164 was written withT.float32accumulation as part of the original file creation, not as a change fromT.float16. Additionally, test coverage for float16 accumulation already exists intesting/python/kernel/test_tilelang_kernel_gemm_mma_intrinsic.py(line 222), which usesassert_tl_matmul_correctness(128, 128, 128, T.float16, T.float16, T.float16).Likely an incorrect or invalid review comment.
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)
199-199: Appropriate CUDA gating for MMA-based test.The
@tilelang.testing.requires_cudadecorator correctly gates this test, which relies on CUDA tensors and tensor core MMA operations.
403-403: Appropriate CUDA gating added to existing decorators.The additional
@tilelang.testing.requires_cudadecorator properly gates this test for CUDA availability, complementing the existing package and LLVM requirements.testing/python/components/test_storage_rewrite_detect_inplace.py (2)
4-6: No issues found. Thecheck_hip_availabilityfunction exists intilelang/utils.target, returns a boolean as expected, and is imported and used correctly in the test file. The module-level initialization of_IS_HIP_AVAILABLEis appropriate for gating platform-specific test assertions.
60-62: The test already verifies patterns against actual generated kernel source.The assertions check the pattern count directly in the kernel source returned by
kernel.get_kernel_source(), so the HIP and CUDA patterns are automatically verified against actual codegen output at test runtime. Pattern correctness cannot be confirmed through static codebase analysis; it will be validated when the test executes.testing/python/carver/test_tilelang_carver_recommend_hints.py (1)
136-136: Clarify the intent behind selective CUDA gating for FMHA test.The
@tilelang.testing.requires_cudadecorator is applied only totest_fmha_recommend_hints, while the other four test functions (test_general_reduction_recommend_hints,test_elementwise_recommend_hints,test_matmul_recommend_hints,test_gemv_recommend_hints) do not have it. All tests useauto_infer_current_arch()and carver templates.Notably, the FlashAttentionTemplate implementation contains no obvious CUDA-specific code—it uses generic TVM operations like
te.placeholder()andte.compute(). Therequires_cudadecorator is imported from TVM's testing utilities and checks for CUDA availability at runtime.Given the PR context ("Open Rocm ci test and fix some bugs"), this selective gating may be intentional (e.g., FMHA not yet validated on ROCm), but this should be documented if it's a deliberate restriction rather than an oversight.
testing/python/debug/test_tilelang_debug_print.py (2)
2-2: LGTM: pytest import for parametrization.The import is necessary for the
pytest.mark.parametrizedecorator added below and follows standard testing practices.
20-24: Good refactor: Parametrized dtype testing.The parametrization consolidates testing across 12 dtypes and appropriately separates FP8 types into hardware-specific tests (CUDA-only and ROCm-only variants). Bfloat16 and other parameterized dtypes are widely supported across both CUDA and ROCm platforms.
src/op/logical.cc (2)
53-54: Verify HIP backend support for the external call.Similar to
tl.any_of, the HIP intrinsic registration fortl.all_ofreuses the lowering function that generates acall_externto"tl::All". Ensure that the HIP backend/runtime provides an implementation for this external symbol.The verification script provided in the previous comment will also check for
"tl::All"implementations.
45-46: HIP backend support verified for external calls.The HIP intrinsic registrations at lines 45-46 and 53-54 are correct. Both
tl::Anyandtl::Allimplementations exist insrc/tl_templates/hip/common.hwith identical logic to their CUDA counterparts insrc/tl_templates/cuda/common.h, confirming that thecall_externreferences will resolve properly at compile time.src/target/codegen_hip.cc (2)
945-947: LGTM! FP8 e5m2 dtype mappings added for MFMA.The mappings for
float8_e5m2fnuzx4andfloat8_e5m2fnuzx8are consistent with the existing e4m3 variants, correctly mapping tofp8_e5_4_tandlongrespectively.
985-994: LGTM! Warp reduction intrinsics for HIP.The codegen correctly emits calls to
tl::warp_reduce_*functions that correspond to the Python intrinsics defined intilelang/language/reduce_op.py. The pattern is consistent with other intrinsic handling in this file.tilelang/engine/lower.py (3)
13-13: LGTM! Centralized path constants.Importing path constants from
tilelang.envcentralizes configuration and removes duplicated path resolution logic.
71-75: LGTM! Clean path usage for CUDA compilation.Using the centralized
TILELANG_TEMPLATE_PATHandCUTLASS_INCLUDE_DIRconstants simplifies the include path configuration.
118-122: LGTM! Clean path usage for HIP compilation.Consistently uses
TILELANG_TEMPLATE_PATHandCOMPOSABLE_KERNEL_INCLUDE_DIRfor HIP compilation, mirroring the CUDA path pattern.src/tl_templates/cuda/reduce.h (2)
177-179: LGTM! CumSum2D signature and SEG parameterization.The return type change to
voidis appropriate since the function writes results todstrather than returning a value. The configurableSEGtemplate parameter (defaulting to 32 for CUDA warp size) provides flexibility while maintaining sensible defaults.
185-191: LGTM! Improved thread/lane calculation and boundary check.The
lane = tid % SEGandrow = tid / SEGpattern allows proper mapping to 2D positions with configurable segment size. The early return ongRow >= Hcorrectly constrains processing to valid rows.src/tl_templates/hip/reduce.h (4)
76-77: LGTM! Type-safe shuffle wrapper usage.Correctly uses
tl::shfl_downinstead of the raw HIP intrinsic, ensuring proper handling ofhalf_tandbfloat16_ttypes through the specializations incommon.h.
107-108: LGTM! Type-safe shuffle wrapper in AllReduce.Uses
tl::shfl_xorfor consistent type-safe shuffling across all data types.
117-182: LGTM! CumSum1D implementation for HIP.The 1D cumulative sum correctly uses
tl::shfl_down,tl::shfl_up, andtl::shflwrappers throughout. TheSEG=64default is appropriate for AMD wavefront size.
184-260: LGTM! CumSum2D implementation for HIP.The 2D cumulative sum mirrors the CUDA version with appropriate use of
tl::shfl_*wrappers andSEG=64default for AMD wavefront size. The Axis-driven indexing logic is consistent with the CUDA implementation.src/tl_templates/hip/common.h (4)
3-5: LGTM! New includes for modular headers.Adding
atomic.h(for relocated atomic operations) andamd_warp_functions.h(for warp intrinsics) properly modularizes the codebase.
110-130: LGTM! Any/All utility functions.Simple and correct implementations of warp-level any/all predicates.
132-148: LGTM! Generic shuffle wrapper templates.The generic templates provide a clean abstraction over HIP's raw
__shfl_*intrinsics.
150-198: LGTM! Type-safe shuffle specializations for half/bfloat16.The specializations for
half_tandbfloat16_tcorrectly convert tofloatbefore shuffling and back after, ensuring proper warp communication for these 16-bit types that HIP intrinsics don't natively support.src/tl_templates/hip/debug.h (5)
6-21: LGTM! Clean trait-based debug printing architecture.The
PrintTraits<T>primary template provides a sensible fallback that prints the address for unknown types, while the macro-generated specializations handle known types with proper format specifiers.
23-38: LGTM! Macro for generating print traits.The
DEFINE_PRINT_TRAITmacro eliminates boilerplate for each type specialization while maintaining consistent formatting.
56-57: LGTM! FP8 type support in debug printing.Adding
fp8_e4_tandfp8_e5_ttraits ensures these new types can be properly debug-printed with float conversion.
76-90: LGTM! Generic pointer specialization.The partial specialization
PrintTraits<T*>handles all pointer types uniformly, correctly using%pformat specifier.
92-100: LGTM! Simplified debug print functions.The public
debug_print_varanddebug_print_buffer_valuefunctions now simply delegate toPrintTraits<T>, keeping the interface clean while the trait system handles type dispatch.testing/python/language/test_tilelang_language_alloc.py (1)
36-36: The relaxed assertions at lines 36 and 76 are intentional and correctly designed. These tests validate a basic feature that works on both CUDA and ROCm backends, and both code generation patterns ("tmp ="and"tmp[0] =") are valid outputs. The tests that lack ROCm support—test_alloc_var_with_initializerandtest_alloc_multi_vars_with_initializer—are already properly gated with@tilelang.testing.requires_cudadecorators and TODO comments. No changes needed.Likely an incorrect or invalid review comment.
testing/python/issue/test_tilelang_issue_1008.py (1)
42-46: LGTM. The@tilelang.testing.requires_cudadecorator is correctly applied to gate these tests for CUDA availability, matching the hardcodeddevice="cuda"on lines 45 and 52, which prevents failures in ROCm-only or non-CUDA CI environments.testing/python/jit/test_tilelang_jit_callback.py (4)
4-4: LGTM: HIP callback import added for multi-platform support.The import correctly adds
register_hip_postproc_callbackto support HIP/ROCm platforms alongside CUDA.
93-96: LGTM: HIP callback implementation correctly mirrors CUDA callback.The HIP post-processing callback follows the same pattern as the CUDA callback (lines 88-91), enabling multi-platform support. Both callbacks being registered within the same function scope is correct—only the relevant callback will be invoked based on the compilation target.
232-232: LGTM: Consistent accumulation dtype change.This change matches the adjustment on line 117, ensuring consistent use of
T.float32for accumulation across both test functions.
117-117: Accumulation dtype change is consistent with codebase patterns.The change to
T.float32for accumulation aligns with all similar tests across the codebase (test_tilelang_jit_tvm_ffi.py,test_tilelang_jit_nvrtc.py,test_tilelang_jit_nullptr.py, andtest_tilelang_jit_parcompile.pyall useaccum_dtype=T.float32). This follows numerical best practices for matrix operations.testing/python/language/test_tilelang_language_infinity.py (2)
3-3: LGTM!The import is properly used for the test decorator and test runner.
33-33: LGTM!Standard test runner invocation is correct.
testing/python/language/test_tilelang_language_composable_index.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
testing/python/cache/test_tilelang_kernel_cache.py (1)
250-250: Same verification needed as test_cache_miss_detection.This test also only compiles without GPU execution. Verify whether CUDA-specific gating is necessary for cache isolation testing, especially in the context of enabling ROCm CI support.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
testing/python/cache/test_tilelang_kernel_cache.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/cache/test_tilelang_kernel_cache.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (1)
testing/python/cache/test_tilelang_kernel_cache.py (1)
199-199: The @requires_cuda decorator on this test is appropriate. All backends tested (tvm_ffi, cython, nvrtc, cutedsl) are CUDA-based, as stated in the test docstring. There are no ROCm/HIP variants or backends in this test file, unlike other test files that define separate@requires_rocmtests for alternative backends. The decorator correctly reflects the actual dependencies.Likely an incorrect or invalid review comment.
| return cache_dir | ||
|
|
||
|
|
||
| @tilelang.testing.requires_cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the test file structure around line 121
head -135 testing/python/cache/test_tilelang_kernel_cache.py | tail -30Repository: tile-ai/tilelang
Length of output: 1029
🏁 Script executed:
# Check the test functions and decorator additions mentioned
rg -n "@tilelang.testing.requires_cuda" testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 167
🏁 Script executed:
# Look for backend definitions and usage
rg -n "BACKENDS" testing/python/cache/test_tilelang_kernel_cache.py -A 10Repository: tile-ai/tilelang
Length of output: 1831
🏁 Script executed:
# Check for ROCm/HIP support in the codebase
rg -n "rocm|hip|amdgpu" --type=py | head -30Repository: tile-ai/tilelang
Length of output: 2890
🏁 Script executed:
# Check the requires_cuda decorator implementation
find . -name "*.py" -type f | xargs grep -l "requires_cuda" | head -10Repository: tile-ai/tilelang
Length of output: 629
🏁 Script executed:
# Look for decorator implementation
rg -n "def requires_cuda|requires_cuda.*=" --type=pyRepository: tile-ai/tilelang
Length of output: 1259
🏁 Script executed:
# Look at test_disk_cache_with_postproc implementation
sed -n '121,200p' testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 2835
🏁 Script executed:
# Look at test_cache_miss_detection implementation
sed -n '199,250p' testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 1816
🏁 Script executed:
# Look at test_cache_isolation_between_tests implementation
sed -n '250,300p' testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 1474
🏁 Script executed:
# Check if there are .cuda() calls in the test file
rg -n "\.cuda\(\)" testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 175
🏁 Script executed:
# Check for any GPU execution or device calls
rg -n "tvm\.cuda|device|gpu|\.to_device" testing/python/cache/test_tilelang_kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 144
🏁 Script executed:
# Check if tilelang.compile requires CUDA or can work without it
rg -n "def compile" tilelang --type=py -A 5 | head -40Repository: tile-ai/tilelang
Length of output: 2784
🏁 Script executed:
# Check if the backends can compile without GPU
rg -n "execution_backend" tilelang --type=py | head -20Repository: tile-ai/tilelang
Length of output: 1717
🏁 Script executed:
# Check if there are other tests that compile without requires_cuda
rg -B 3 "tilelang.compile" testing --type=py | grep -A 3 "def test_" | head -30Repository: tile-ai/tilelang
Length of output: 42
Two tests should not have the @tilelang.testing.requires_cuda decorator.
test_disk_cache_with_postproc correctly has the decorator since it executes kernels on GPU (lines 187-188). However, test_cache_miss_detection and test_cache_isolation_between_tests only call tilelang.compile() without any GPU execution and should not be gated to CUDA-only.
Given the PR objective to "Open Rocm ci test", decorating compilation-only tests with requires_cuda will skip them entirely on ROCm CI. Remove the @tilelang.testing.requires_cuda decorator from test_cache_miss_detection (line 199) and test_cache_isolation_between_tests (line 250) so these tests run on all platforms.
🤖 Prompt for AI Agents
testing/python/cache/test_tilelang_kernel_cache.py lines 199 and 250: the tests
`test_cache_miss_detection` (line 199) and `test_cache_isolation_between_tests`
(line 250) are compilation-only and should not be CUDA-gated; remove the
`@tilelang.testing.requires_cuda` decorator from both test definitions so they
run on all platforms (leave the decorator on `test_disk_cache_with_postproc`
which runs GPU kernels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
testing/python/language/test_tilelang_language_vectorize.py (2)
51-57: Other CUDA-dependent tests lack therequires_cudadecorator.For consistency with the gating applied to
test_vectorize_all_dtypes, consider also decoratingtest_vectorizeandtest_vectorize_invariant_indexwith@tilelang.testing.requires_cuda, as they also create CUDA tensors (viarun_vectorizeandrun_vectorize_invariant_index).🔎 Suggested change
+@tilelang.testing.requires_cuda def test_vectorize(): N, M = 512, 256+@tilelang.testing.requires_cuda def test_vectorize_invariant_index(): N, M = 512, 256
127-148: Consider adding an assertion to verify kernel output.The decorator addition for CUDA gating is appropriate. However, the test only verifies that the kernel executes without error—it doesn't assert that the output tensor
xcontains the expected values (i.e.,[1, 2, 3, ..., vec_num, 0, 0, ...]cast to the given dtype).Additionally,
torch.float8_e8m0fnu(line 141) was introduced in PyTorch 2.7+ and may not be available in all environments.🔎 Suggested assertion example
def test_vectorize_all_dtypes(dtype, vec_num): x = torch.empty((64,), dtype=dtype, device="cuda") kernel = vectorize_test_all_dtypes(dtype, vec_num) kernel(x) + # Verify first vec_num elements contain expected values + expected = torch.arange(1, vec_num + 1, dtype=dtype, device="cuda") + torch.testing.assert_close(x[:vec_num], expected)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
testing/python/language/test_tilelang_language_vectorize.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_vectorize.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
LeiWang1999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
testing/python/language/test_tilelang_language_var_init.py (1)
17-17: Fix comment typo.The comment references variable
c, but the actual variable name isd.🔎 Proposed fix
- d = T.alloc_var(T.int32, init=a) # c gets new value of a + d = T.alloc_var(T.int32, init=a) # d gets new value of a
♻️ Duplicate comments (6)
testing/python/cache/test_tilelang_kernel_cache.py (2)
199-199: Remove CUDA-gating from compilation-only test.
test_cache_miss_detectiononly callstilelang.compile()without any GPU execution. Gating it with@tilelang.testing.requires_cudawill skip it entirely on ROCm CI, contradicting the PR objective to "Open Rocm ci test." Remove this decorator so the test runs on all platforms including ROCm.🔎 Proposed fix
-@tilelang.testing.requires_cuda @pytest.mark.parametrize("backend", BACKENDS) def test_cache_miss_detection(clean_cache_env, backend):
250-250: Remove CUDA-gating from compilation-only test.
test_cache_isolation_between_testsonly callstilelang.compile()without any GPU execution. Gating it with@tilelang.testing.requires_cudawill skip it entirely on ROCm CI, contradicting the PR objective to "Open Rocm ci test." Remove this decorator so the test runs on all platforms including ROCm.🔎 Proposed fix
-@tilelang.testing.requires_cuda @pytest.mark.parametrize("backend", BACKENDS) def test_cache_isolation_between_tests(clean_cache_env, backend):testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
133-156: Add tracking issue references to skip decorators.The FP8 GEMM SS tests are skipped with a generic reason. Per previous review feedback, create tracking issues for the known GEMM SS problems and reference them in the skip decorators for traceability.
Example:
@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX")
290-313: Add tracking issue references to skip decorators.Similar to the SS variant, these FP8 GEMM RS tests need tracking issue references in their skip decorators for better traceability.
testing/python/language/test_tilelang_language_atomic_add.py (1)
248-254: Add missing@tilelang.testing.requires_cudadecorator for consistency.
test_atomic_addx2_half(Line 248) is decorated with@tilelang.testing.requires_cuda, buttest_atomic_addx2_float(Line 253) is not. Both tests use the same underlyingatomic_addx2operation, which is CUDA-specific. This inconsistency may causetest_atomic_addx2_floatto fail on non-CUDA platforms.🔎 Proposed fix
+@tilelang.testing.requires_cuda def test_atomic_addx2_float(): run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)src/tl_templates/hip/reduce.h (1)
262-271: Usetl::shfl_xorinstead of raw__shfl_xorfor type safety.The
warp_reducefunction uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results forhalf_tandbfloat16_ttypes. These types require thetl::shfl_xorwrapper (defined incommon.hlines 134-180) that converts to float before shuffling.This is inconsistent with the rest of the file where
tl::shfl_*wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).🔎 Proposed fix
template <typename T, typename ReduceOp> TL_DEVICE T warp_reduce(T value, ReduceOp op) { - value = op(value, __shfl_xor(value, 32)); - value = op(value, __shfl_xor(value, 16)); - value = op(value, __shfl_xor(value, 8)); - value = op(value, __shfl_xor(value, 4)); - value = op(value, __shfl_xor(value, 2)); - value = op(value, __shfl_xor(value, 1)); + value = op(value, tl::shfl_xor(value, 32)); + value = op(value, tl::shfl_xor(value, 16)); + value = op(value, tl::shfl_xor(value, 8)); + value = op(value, tl::shfl_xor(value, 4)); + value = op(value, tl::shfl_xor(value, 2)); + value = op(value, tl::shfl_xor(value, 1)); return value; }
🧹 Nitpick comments (8)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
373-378: Consider adding@tilelang.testing.requires_cudadecorator for consistency.The test already has runtime Hopper GPU checking, but adding the
@tilelang.testing.requires_cudadecorator would be more consistent withtest_tvm_ffi_l2_persistent_mapsince TMA (Tensor Memory Accelerator) is CUDA-specific architecture.🔎 Proposed addition
+@tilelang.testing.requires_cuda def test_tvm_ffi_im2col_tma_desc(): """Test im2col TMA descriptor with tvm_ffi backend.""" if not check_hopper():testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)
432-441: Convert commented-out tests to explicit skip decorators.The commented-out test cases with TODO notes about ROCm precision problems reduce visibility and make systematic re-enabling harder. Consider:
- Creating tracking issues for the precision problems
- Uncommenting the test cases and marking them with
@pytest.mark.skipdecorators that reference the issues- Adding platform-specific skip conditions if needed
This approach improves traceability and makes it easier to re-enable tests once issues are resolved.
Example refactor
@tilelang.testing.requires_rocm @pytest.mark.skip(reason="ROCm precision problem for int8 SR variant. See issue #XXXX") @pytest.mark.parametrize(...) def test_gemm_sr_int8_rocm_precision(...): ...
459-470: Use explicit skip decorator for precision-problematic test case.Line 464 has a commented-out
float8_e5m2fnuztest with a TODO about precision problems. Similar to the earlier recommendation, use an explicit skip decorator with a tracking issue reference instead of commenting out the test.
595-597: Convert commented-out tests to explicit skip decorators.Similar to Lines 432-441, these commented-out test cases should be converted to explicit skip decorators with tracking issue references for better visibility and traceability.
623-634: Use explicit skip decorator for precision-problematic test case.Line 628 has a commented-out
float8_e5m2fnuztest case. Use an explicit skip decorator with tracking issue reference instead.testing/python/language/test_tilelang_language_atomic_add.py (1)
198-200: Consider aligning default dtype with other atomic functions.While the dtype parameterization is correct, the default value
T.float16differs from other atomic operation functions in this file (e.g.,atomic_add_program,atomic_max_program), which default toT.float32. This inconsistency might cause confusion, though it could be intentional given thatatomic_addx2is typically used with FP16 pairs.🔎 Proposed alignment
-def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16): +def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float32):And similarly for
run_atomic_addx2:-def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float16): +def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float32):testing/python/language/test_tilelang_language_ptr.py (1)
49-52: Consider adding CUDA availability check.The test allocates tensors on CUDA device without checking if CUDA is available. If this test is intended to run only on CUDA-capable systems, consider adding the
@tilelang.testing.requires_cudadecorator totest_matmul()(line 61), consistent with the gating pattern mentioned in the PR summary.🔎 Proposed fix to add CUDA gating
Add the decorator to the test function:
+@tilelang.testing.requires_cuda def test_matmul(): run_matmul(1024, 1024, 1024, 128, 128, 32)src/tl_templates/hip/atomic.h (1)
1-104: Acknowledge the memory_order parameter limitation.All atomic functions accept a
memory_orderparameter but ignore it, as documented in the comments. While a past review flagged this inconsistency with CUDA (which conditionally applies memory ordering), it was marked as addressed in commits 14067c3 to d592f8b.The current approach—keeping the parameter for lowering compatibility while ignoring it—is pragmatic if HIP lacks equivalent memory ordering semantics. However, consider documenting at the file level why HIP cannot support explicit memory ordering (e.g., lack of equivalent HIP atomic variants or fence semantics) to prevent future confusion.
📝 Suggested file-level documentation
Add a comment at the top of the file explaining the memory_order limitation:
#pragma once #include <hip/hip_runtime.h> + +// Note: All atomic functions accept a memory_order parameter to match the +// lowering interface, but HIP atomics do not support explicit memory ordering +// semantics equivalent to CUDA's acquire/release/seq_cst variants. The parameter +// is ignored and all operations use HIP's default atomic behavior. Use explicit +// fences (__threadfence_block, __threadfence, __threadfence_system) if stronger +// ordering guarantees are required. // Add an extra unused input to accommodate the additional 'memory_order'
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (54)
.github/workflows/ci.ymlsrc/op/gemm.ccsrc/op/logical.ccsrc/target/codegen_hip.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/atomic.hsrc/tl_templates/hip/common.hsrc/tl_templates/hip/debug.hsrc/tl_templates/hip/hip_fp8.hsrc/tl_templates/hip/reduce.htesting/python/autotune/test_tilelang_autotune.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/debug/test_device_assert.pytesting/python/debug/test_tilelang_debug_print.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/issue/test_tilelang_issue_830.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_alias.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_composable_index.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_vectorize.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytilelang/engine/lower.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
- testing/python/language/test_tilelang_language_composable_index.py
- testing/python/language/test_tilelang_language_copy.py
- testing/python/issue/test_tilelang_issue_830.py
🚧 Files skipped from review as they are similar to previous changes (21)
- src/op/gemm.cc
- testing/python/language/test_tilelang_language_clear.py
- testing/python/language/test_tilelang_language_mask_op.py
- testing/python/language/test_tilelang_language_alias.py
- testing/python/jit/test_tilelang_jit_gemm.py
- tilelang/engine/lower.py
- testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
- testing/python/components/test_storage_rewrite_detect_inplace.py
- testing/python/debug/test_device_assert.py
- testing/python/issue/test_tilelang_issue_1001.py
- testing/python/autotune/test_tilelang_autotune.py
- testing/python/kernel/test_tilelang_kernel_gemm_simt.py
- src/tl_templates/hip/debug.h
- testing/python/language/test_tilelang_language_annotate_safe_value.py
- testing/python/language/test_tilelang_language_vectorized_cast.py
- .github/workflows/ci.yml
- testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
- testing/python/kernel/test_tilelang_kernel_gemm.py
- testing/python/language/test_tilelang_language_annot.py
- testing/python/issue/test_tilelang_issue_1008.py
- testing/python/language/test_tilelang_language_unroll.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/language/test_tilelang_language_vectorize.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/jit/test_tilelang_jit_nvrtc.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/language/test_tilelang_language_alloc.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Applied to files:
tilelang/intrinsics/mfma_layout.pysrc/tl_templates/hip/reduce.h
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.
Applied to files:
testing/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
src/tl_templates/hip/reduce.h
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (10)
tilelang/intrinsics/mfma_layout.py (1)
tilelang/language/v2/builder.py (1)
const(833-844)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
run(121-181)run(188-259)
tilelang/intrinsics/mfma_macro_generator.py (1)
tilelang/language/v2/dtypes.py (1)
int8(243-243)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (5)
tilelang/language/v2/dtypes.py (8)
float16(299-299)float32(300-300)float8_e5m2(355-355)float8_e5m2fnuz(362-362)float8_e4m3fnuz(348-348)int8(243-243)int32(245-245)bfloat16(397-397)tilelang/tileop/gemm/gemm_base.py (6)
M(35-36)N(39-40)K(43-44)trans_A(47-48)trans_B(51-52)in_dtype(55-57)tilelang/language/kernel.py (1)
num_threads(221-225)testing/python/amd/test_tilelang_test_amd.py (1)
run_gemm_rs(195-239)maint/gemm_v2/correctness_evaluation.py (3)
run_gemm_rs(178-210)run_gemm_sr(262-295)run_gemm_rr(350-383)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/atomic.py (1)
atomic_addx2(230-265)tilelang/language/proxy.py (1)
Tensor(232-232)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
tl(110-200)shfl_down(138-140)shfl_down(157-161)shfl_down(182-186)shfl_xor(134-136)shfl_xor(151-155)shfl_xor(176-180)shfl(146-148)shfl(169-173)shfl(194-198)shfl_up(142-144)shfl_up(163-167)shfl_up(188-192)
testing/python/issue/test_tilelang_issue_96.py (2)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)tilelang/jit/kernel.py (1)
out_idx(609-610)
src/tl_templates/hip/common.h (2)
src/tl_templates/cuda/common.h (5)
tl(248-342)tl(571-573)tl(574-576)tl(591-671)bfloat16_t(581-582)src/tl_templates/cuda/atomic.h (1)
half_t(24-26)
🔇 Additional comments (63)
testing/python/language/test_tilelang_language_let.py (1)
6-6: Verify CUDA-only gating aligns with ROCm CI objectives.The PR title indicates the goal is to "Open Rocm ci test," but this decorator restricts the test to CUDA-only environments. Given that line 18 hardcodes
target="cuda"and line 19 checks for CUDA-specific output ("float4 b"), this gating may be intentional if the test exercises CUDA-specific features not yet supported in ROCm.Please confirm whether:
- This test is intentionally excluded from ROCm CI, or
- A ROCm-compatible version of this test should be added alongside this CUDA-gated one.
testing/python/issue/test_tilelang_issue_96.py (1)
40-40: LGTM! Change enables multi-backend support.Removing the explicit
target="cuda"parameter allows the compilation target to be auto-detected or controlled via theTILELANG_TARGETenvironment variable, which is essential for ROCm CI support. Thecompilefunction will default to"auto"when no target is specified, enabling the test to run on both CUDA and ROCm systems.Note: The PyTorch
device="cuda"on lines 43-44 is correct, as PyTorch uses the "cuda" device identifier for both NVIDIA CUDA and AMD ROCm backends.testing/python/language/test_tilelang_language_var_init.py (1)
6-7: Appropriate CUDA gating for unsupported HIP feature.The decorator correctly restricts this test to CUDA environments given that variable initialization is not yet supported on HIP.
testing/python/jit/test_tilelang_jit_cutedsl.py (1)
158-158: LGTM! CUDA gating is appropriate for CuTeDSL backend tests.The
@tilelang.testing.requires_cudadecorator additions are correct for these CuTeDSL-specific tests. Since CuTeDSL is a CUDA backend and all tests invoke.cuda()operations, gating them behind CUDA availability enables ROCm CI to skip these gracefully.Also applies to: 210-210, 253-253, 303-303
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
158-158: LGTM! CUDA gating is appropriate for NVRTC backend tests.The
@tilelang.testing.requires_cudadecorator additions are correct for these NVRTC-specific tests. Since NVRTC (NVIDIA Runtime Compilation) is a CUDA-only backend and all tests invoke.cuda()operations, gating them behind CUDA availability enables ROCm CI to skip these gracefully. The Hopper-specific test at line 375 correctly uses layered gating with bothrequires_cudaandcheck_hopper().Also applies to: 210-210, 253-253, 303-303, 375-375, 389-389
testing/python/cache/test_tilelang_kernel_cache.py (1)
121-121: Correct decorator usage for GPU-executing test.This test appropriately has the
@tilelang.testing.requires_cudadecorator since it executes kernels on GPU tensors (lines 187-188, 190-191).testing/python/jit/test_tilelang_jit_gemm_cython.py (1)
169-169: LGTM: Accumulation dtype upgrade improves numerical stability.Changing the accumulation dtype from
T.float16toT.float32across all GEMM tests is a best practice that reduces accumulated rounding errors during matrix multiplication. This is especially important when using mixed-precision computation (float16 inputs/outputs with float32 accumulation) and aligns with the reference implementation which usestorch.float(float32) for computation.The existing test tolerances (
atol=1e-2, rtol=1e-2) remain appropriate and should actually see improved accuracy with this change.Also applies to: 211-211, 255-255, 304-308, 357-357
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)
199-199: LGTM! Appropriate CUDA gating for Tensor Core operations.The decorator correctly gates this test since it relies on NVIDIA-specific INT4 Tensor Core MMA operations, ldmatrix/stmatrix intrinsics, and CUDA device tensors.
403-403: LGTM! Correct decorator stacking for multi-requirement test.The CUDA gating decorator is properly stacked with existing
bitblaspackage and LLVM requirements, ensuring the test only runs when all dependencies are available.testing/python/language/test_tilelang_language_alloc.py (4)
36-36: Clarify the acceptance of two variable allocation patterns.The assertion now accepts either
"tmp ="or"tmp[0] ="in the emitted code. This suggests different code generation between backends (likely CUDA vs ROCm).Please clarify:
- Is
"tmp[0] ="an expected difference in how ROCm/HIP represents scalar variables compared to CUDA?- Could this mask a potential code generation issue where scalar variables are incorrectly treated as arrays?
Consider adding a comment explaining why both patterns are valid, or verify that the backend-specific pattern is correct.
The same concern applies to line 76.
118-120: Temporary ROCm gating is appropriate.The explicit CUDA-only gating with a TODO comment is a reasonable approach for incrementally enabling ROCm support. This allows the CI to run on CUDA while tracking tests that need ROCm enablement.
Consider tracking these TODOs in a tracking issue to ensure they're addressed systematically.
154-156: Verify consistency of kernel source retrieval across tests.This test uses
kernel.get_kernel_source(kernel_only=True)while the tests at lines 35 and 75 usekernel.get_kernel_source()without thekernel_onlyparameter.Please clarify:
- Is
kernel_only=Truenecessary here due to the specific counting logic (counting occurrences of"= 1;"and"= 2;")?- Should the earlier tests (lines 35, 75) also use
kernel_only=Truefor more precise assertions?Consistent usage across similar tests improves maintainability.
159-161: Temporary ROCm gating is appropriate.Same approach as the previous gated test. The consistent pattern of TODO comments and CUDA-only gating facilitates incremental ROCm enablement.
testing/python/language/test_tilelang_language_infinity.py (1)
3-3: LGTM! Essential import bug fix.This import was missing despite the code using
tilelang.testing.requires_cuda(line 24) andtilelang.testing.main()(line 33). Without this explicit import, the test would fail with anAttributeErrorat runtime.testing/python/language/test_tilelang_language_vectorize.py (1)
127-127: Verify the intent behind CUDA-only gating in a ROCm enablement PR.The decorator correctly gates this test given the hardcoded
device="cuda"on line 146. However, since this PR aims to enable ROCm CI support, consider whether this test could be updated to support both CUDA and ROCm devices (e.g., parameterizing the device) rather than restricting it to CUDA-only.If there's a technical reason this test must remain CUDA-only (e.g., certain FP8 dtypes tested here have different support on ROCm), please clarify the reasoning.
Optional: Parameterize device for ROCm support
If ROCm support is intended for this test, you could parameterize the device:
@tilelang.testing.requires_cuda @pytest.mark.parametrize( "dtype", [ torch.uint8, ... ], ) @pytest.mark.parametrize("vec_num", [1, 2, 4, 8]) -def test_vectorize_all_dtypes(dtype, vec_num): - x = torch.empty((64,), dtype=dtype, device="cuda") +@pytest.mark.parametrize("device", ["cuda"]) # Add "hip" when ROCm support is ready +def test_vectorize_all_dtypes(dtype, vec_num, device): + x = torch.empty((64,), dtype=dtype, device=device) kernel = vectorize_test_all_dtypes(dtype, vec_num) kernel(x)Then update the decorator to
@tilelang.testing.requires_gpu(if available) instead of CUDA-specific.testing/python/jit/test_tilelang_jit_tvm_ffi.py (4)
167-167: LGTM: Improved numerical stability with float32 accumulation.Switching from float16 to float32 accumulation while keeping float16 input/output is a standard mixed-precision pattern that improves numerical accuracy without significantly impacting memory usage.
210-210: LGTM: Consistent accumulation dtype across test suite.The float32 accumulation dtype change is consistently applied across benchmarking and multi-stream tests, ensuring uniform numerical behavior.
Also applies to: 252-252
301-301: LGTM: Float32 accumulation applied to dynamic shape tests.The accumulation dtype change is correctly applied across all dynamic shape test variants, maintaining consistency with static shape tests.
Also applies to: 303-303, 306-306
386-386: LGTM: Correct CUDA-only gating for L2 persistent cache test.The decorator is appropriate since the test validates CUDA-specific L2 persistent cache APIs (
__tvm_cuda_stream_set_access_policy_window_packed). This feature is not available on ROCm.testing/python/carver/test_tilelang_carver_recommend_hints.py (1)
136-136: Consider whether@tilelang.testing.requires_cudadecorator appropriately gates FMHA tests given ROCm enablement goals.The FlashAttentionTemplate implementation uses architecture-agnostic TVM APIs (
get_tensorized_func_and_tagswithself.arch.target) and the test callsauto_infer_current_arch(), which should work on both CUDA and ROCm. However, this test is CUDA-gated while similar templates in the same file—MatmulTemplate, GEMVTemplate, ElementwiseTemplate, and GeneralReductionTemplate—are not, making the restriction inconsistent.Verify whether:
- FlashAttentionTemplate has runtime or compiler-level ROCm limitations not visible in the template code
- The decorator should be removed or replaced with separate ROCm and CUDA tests following the pattern shown in
test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (6)
114-118: LGTM! Float32 accumulation improves numerical stability.The addition of
T.float32asdtypeAccumfor float16 tests follows best practices for preventing overflow and maintaining precision during accumulation.
271-275: LGTM! Consistent accumulation dtype usage.The float32 accumulation for float16 inputs is correctly applied to the RS variant tests.
427-431: LGTM! Float32 accumulation correctly applied.
448-456: Verify SR FP8 stability on CUDA.Unlike the SS and RS variants (which are skipped), this SR FP8 CUDA test is enabled. Confirm that the SR variant is stable enough for FP8 on CUDA, or if this should also be skipped until GEMM issues are resolved.
590-593: LGTM! Float32 accumulation correctly applied.
612-620: Verify RR FP8 stability on CUDA.Like the SR variant, this RR FP8 CUDA test is enabled while SS and RS are skipped. Confirm that the RR variant has sufficient stability for FP8 on CUDA.
testing/python/language/test_tilelang_language_frontend_v2.py (1)
322-323: LGTM! Appropriate workaround for ROCm limitation.Gating this test behind CUDA availability is the right approach given that ROCm doesn't currently support
alloc_varwith initializer (line 329 usesT.alloc_var(T.int32, 0)). The TODO comment clearly documents the limitation. Consider tracking this as a formal issue to ensure ROCm support foralloc_varwith initializer is added in the future.src/op/logical.cc (2)
53-54: LGTM! Consistent HIP support addition.The HIP intrinsic registration follows the same pattern as
tl.any_ofabove, correctly reusing the lowering function for both CUDA and HIP backends. This maintains consistency across the logical operations.
45-46: HIP runtime implementations confirmed to exist.The registration correctly adds HIP support alongside CUDA, reusing the same lowering function. Both
tl::Anyandtl::Allare properly implemented as device templates insrc/tl_templates/hip/common.h, ensuring external symbol resolution at link time.testing/python/jit/test_tilelang_jit_callback.py (4)
4-4: LGTM!The import correctly extends callback support to include HIP alongside CUDA.
93-96: LGTM! HIP callback mirrors the CUDA pattern.The implementation correctly follows the CUDA postproc callback structure. Note that like the CUDA callback, this registration will persist globally after the function executes.
Based on learnings, the persistent callback behavior is intentional.
232-232: Verify the accumulation dtype change in the active test.This test actively runs and validates numerical correctness. The change from
T.float16toT.float32for accumulation improves numerical precision but could indicate:
- A bug fix for numerical accuracy issues
- A HIP-specific requirement
- A general best practice change
Please confirm:
- Is this change necessary for HIP support, or is it a general fix?
- Have you verified the test passes with this change on both CUDA and HIP (if applicable)?
- Does this align with the "fix some bugs" mentioned in the PR title?
117-117: No action needed — accumulation dtype change aligns with codebase patterns.The change to
T.float32for accumulation dtype is consistent with standard mixed-precision design used throughout the test suite (float16 inputs with float32 accumulation for numerical stability). This pattern is used in all similar tests and applies to both CUDA and HIP paths equally. Since the test is skipped, this change has no current impact.testing/python/jit/test_tilelang_jit_parcompile.py (1)
61-62: The block_K configurations are intentional and mathematically sound. Both configurations maintain constant shared memory usage (~32KB):
- Config 1 (1024³): block_K=64 → A_shared(128×64) + B_shared(64×128) = 32KB
- Config 2 (2048³): block_K=32 → A_shared(256×32) + B_shared(32×256) = 32KB
As matrix dimensions double, block_K is halved to maintain the shared memory constraint—a critical optimization for ROCm devices. The commit message "Fix some bugs on ci and open rocm ci test" confirms this is a deliberate bugfix, not an accidental swap. No changes needed.
testing/python/language/test_tilelang_language_atomic_add.py (3)
238-240: LGTM!The
@tilelang.testing.requires_cudadecorator appropriately gates this CUDA-specific atomic operation test.
243-245: LGTM!The
@tilelang.testing.requires_cudadecorator correctly gates this CUDA-specific test for atomic operations with memory ordering.
353-357: LGTM!The
@tilelang.testing.requires_cudadecorator properly gates this test, and the addition of multiple dtype tests (float32, float16, bfloat16) improves test coverage for atomic operations with different memory orderings.testing/python/language/test_tilelang_language_ptr.py (1)
44-44: The code is correct and does not require changes. The cython execution backend is explicitly supported for CUDA targets (verified intilelang/jit/execution_backend.py::allowed_backends_for_target()where cuda allows["tvm_ffi", "nvrtc", "cython"]). Target autodetection withdetermine_target("auto")explicitly detects CUDA availability viatorch.cuda.is_available()and returns the appropriate CUDA target with device capability. The CythonKernelAdapter is documented as converting TIR functions to compiled CUDA libraries.Likely an incorrect or invalid review comment.
testing/python/debug/test_tilelang_debug_print.py (3)
2-2: LGTM!The pytest import is necessary for the parameterized test decorator introduced later in the file.
20-24: LGTM!The parameterized test refactoring consolidates multiple dtype test cases into a single, maintainable test. Platform-specific FP8 types are appropriately tested separately with conditional decorators.
50-50: No actionable changes needed; tests properly support platform-agnostic GPU compilation.The removal of explicit
target="cuda"correctly enables ROCm CI testing. The tests contain GPU-specific operations (T.Kernel,T.alloc_shared,T.alloc_fragment) and rely on the defaulttarget="auto", which usesdetermine_target()to automatically select CUDA or HIP based on available hardware. Importantly, if neither CUDA nor HIP is available, the function raisesValueError—tests cannot silently fall back to CPU execution. The change properly aligns with enabling cross-platform GPU testing without requiring conditional decorators.src/tl_templates/hip/hip_fp8.h (1)
1-1: LGTM! Include guard added.The addition of
#pragma oncefollows standard practice for header files and prevents multiple inclusion issues.tilelang/jit/adapter/wrapper.py (3)
314-316: Good refactor to extract declaration parsing logic.Moving the declaration extraction into a dedicated method improves code organization and maintainability.
612-615: Appropriate HIP-specific declaration parsing.The HIP implementation correctly splits on
"{"instead of";"to handle HIP's kernel declaration syntax, which goes directly from signature to function body without a semicolon-terminated forward declaration.
585-585: Both type mappings are actively used and necessary.The additions
"float8_e5m2fnuz": "fp8_e5_t"and"uint64": "uint64_t"are legitimately needed for HIP kernel generation:
- float8_e5m2fnuz is a first-class dtype with extensive kernel support (GEMM kernels, MMA/MFMA intrinsic generators, test coverage with actual computations)
- uint64 is used during kernel argument generation for tensor descriptor dimensions and strides, which are always 64-bit in GPU addressing
Both types are looked up via
_lookup_type()when building kernel function signatures.src/tl_templates/cuda/reduce.h (3)
185-186: Good refactor to parameterize segment size.Replacing hardcoded
32with theSEGtemplate parameter improves flexibility and aligns with the HIP implementation.
190-191: Good addition of boundary check.The early exit
if (gRow >= H) return;prevents out-of-bounds memory access when the global row index exceeds the height dimension.
178-179: The function signature at lines 178-179 showsvoidas the return type, which is consistent withCumSum1D(line 110) and the HIP implementation (line 188-189). This is not a breaking change—the function has always returnedvoid, writing results to thedstparameter. The call site insrc/op/reduce.cc:559correctly handles this by passingdstPtras a parameter and wrapping the result inEvaluate(), which is the correct pattern for void external function calls. No action is required.Likely an incorrect or invalid review comment.
tilelang/intrinsics/mfma_layout.py (1)
3-3: No changes needed—const(0) usage is correct.The change from
convert(0)toconst(0)on lines 9 and 20 is semantically appropriate. These layout functions return TIR expressions, andtvm.tir.const()is the correct API for creating compile-time integer constants in that context. The remainingconvert()call at line 38 correctly usesconvert()for its intended purpose of converting a Python list to a TVM runtime object. No otherconvert(0)patterns exist in the intrinsics code.tilelang/intrinsics/mfma_macro_generator.py (3)
53-53: LGTM! FP8 e5m2fnuz abbreviation mapping is consistent.The addition of
"float8_e5m2fnuz": "e5m2fnuz"follows the same pattern as the existing"float8_e4m3fnuz"mapping.
111-111: LGTM! FP8 e5m2fnuz k_dim handling is correct.The new check correctly treats
"float8_e5m2fnuz"the same as"float8_e4m3fnuz"andT.int8, settingk_dim = 32as expected for 8-bit types in MFMA operations.
145-145: LGTM! FP8 e5m2fnuz MFMA prefix mapping is consistent.The mapping of
"float8_e5m2fnuz": "fp8"correctly groups it with"float8_e4m3fnuz"for MFMA suffix generation, ensuring both FP8 variants use the samefp8_fp8suffix pattern.src/target/codegen_hip.cc (2)
945-946: LGTM! FP8 e5m2fnuz type mappings are consistent.The new mappings for
float8_e5m2fnuzx4andfloat8_e5m2fnuzx8correctly follow the same pattern as the existing e4m3fnuz mappings (lines 943-944), ensuring uniform FP8 type handling in MFMA operations.
985-994: LGTM! Warp reduction operations are correctly integrated.The new warp reduction operations properly delegate to the corresponding
tl::warp_reduce_*functions defined insrc/tl_templates/hip/reduce.h(lines 273-291), following the established pattern for HIP builtin operations.src/tl_templates/hip/common.h (4)
3-5: LGTM! Required headers added for HIP warp and atomic operations.The new includes for
atomic.handhip/amd_detail/amd_warp_functions.hprovide the necessary infrastructure for the warp shuffle and atomic operations introduced in thetlnamespace.
110-130: LGTM! Any and All helper functions are correctly implemented.The
tl::Anyandtl::Allhelper functions provide simple, clear array predicates that will be useful for reduction and warp-level logic.
134-148: LGTM! Generic shfl wrappers correctly delegate to HIP intrinsics.The generic
tl::shfl_xor,tl::shfl_down,tl::shfl_up, andtl::shflwrappers properly delegate to the corresponding HIP intrinsics. The TODO comment appropriately notes that HIP/ROCm doesn't yet provideshfl_syncvariants (available in ROCm 7.1.1+).
151-198: LGTM! FP16/BF16 shfl specializations correctly handle type conversions.The specializations for
half_t(lines 151-173) andbfloat16_t(lines 176-198) correctly convert tofloatbefore shuffling and convert back afterward. This matches the pattern used in the CUDA version and ensures correct behavior for these types that don't have native HIP shuffle support.src/tl_templates/hip/reduce.h (5)
76-76: LGTM! Correctly uses type-safe tl::shfl_down wrapper.The change from
__shfl_downtotl::shfl_downensures correct behavior forhalf_tandbfloat16_ttypes by using the specializations defined incommon.h.
107-107: LGTM! Correctly uses type-safe tl::shfl_xor wrapper.The change from
__shfl_xortotl::shfl_xorensures correct behavior forhalf_tandbfloat16_ttypes by using the specializations defined incommon.h.
117-182: LGTM! CumSum1D implementation correctly uses tl::shfl wrappers.The
CumSum1Dimplementation properly usestl::shfl_down(line 142),tl::shfl_up(line 165), andtl::shfl(lines 152, 155, 175, 178) throughout, ensuring type safety for all supported types includinghalf_tandbfloat16_t.
184-260: LGTM! CumSum2D implementation correctly uses tl::shfl wrappers.The
CumSum2Dimplementation properly usestl::shfl_down(line 216),tl::shfl_up(line 242), andtl::shfl(lines 226, 229, 252, 255) throughout, ensuring type safety for all supported types includinghalf_tandbfloat16_t.
273-291: LGTM! Warp reduction wrappers are correctly implemented.The
warp_reduce_sum,warp_reduce_max,warp_reduce_min,warp_reduce_bitand, andwarp_reduce_bitorfunctions properly delegate towarp_reducewith the appropriate operation. Once thewarp_reducefunction is fixed to usetl::shfl_xor, these wrappers will work correctly for all types.
| A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) | ||
| B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check tensor initialization patterns in the test file
echo "=== Tensor initialization patterns in test_tilelang_language_atomic_add.py ==="
rg -n -B1 -A2 "torch\.(randn|zeros|arange)" testing/python/language/test_tilelang_language_atomic_add.py | head -80Repository: tile-ai/tilelang
Length of output: 2400
🏁 Script executed:
#!/bin/bash
# Identify which functions contain the anomalous initialization patterns
echo "=== Function containing lines 214-215 (two-step .to() conversion) ==="
sed -n '200,220p' testing/python/language/test_tilelang_language_atomic_add.py
echo -e "\n=== Function containing lines 309-310 (float32 only, no conversion) ==="
sed -n '300,320p' testing/python/language/test_tilelang_language_atomic_add.pyRepository: tile-ai/tilelang
Length of output: 1627
Align tensor initialization with standard pattern used elsewhere in the file.
The tensor initialization in lines 214-215 uses a two-step approach (float32 → conversion), but this is inconsistent with the standard pattern throughout the file. The run_atomic_addx2 function accepts a dtype parameter yet doesn't use it directly. All other test functions with dtype parameters (e.g., run_atomic_add, run_atomic_add_builtin) create tensors directly with the target dtype: torch.randn(..., dtype=getattr(torch, dtype)).cuda(). Update lines 214-215 to follow this consistent approach, or add a comment explaining why the two-step conversion is necessary.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines
214-215, the tensors A and B are created as float32 then converted to the target
dtype, which is inconsistent with the rest of the file; change the
initialization to construct the tensors directly with the target dtype (e.g.,
torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() and torch.zeros(M, N,
dtype=getattr(torch, dtype)).cuda()), or if there is a deliberate reason to
create as float32 first, add a brief comment explaining why the two-step
conversion is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
testing/python/language/test_tilelang_language_annot.py (1)
25-27: Tests hardcode "cuda" device despite HIP workaround.The workaround at lines 8-10 is specifically for HIP/ROCm, but the test hardcodes
device="cuda". This creates two problems:
- The test will fail on ROCm-only systems where the "cuda" device doesn't exist
- The HIP-specific workaround won't actually be validated on HIP hardware
Since this PR aims to "Open Rocm ci test," consider parameterizing the device or using a helper that selects the appropriate device (cuda/hip) at runtime.
🔎 Example fix using device detection
+ # Detect available device + device = "cuda" if torch.cuda.is_available() else "cpu" + ker = example_tensor_annot() - A = torch.arange(16, dtype=torch.int32, device="cuda") + A = torch.arange(16, dtype=torch.int32, device=device) ker(A) - expected = torch.zeros(16, dtype=torch.int32, device="cuda") + expected = torch.zeros(16, dtype=torch.int32, device=device)Note: PyTorch on ROCm typically exposes devices through
torch.cudaAPI, so this should work. Verify that the test framework or tilelang provides a device selector if needed.
♻️ Duplicate comments (11)
testing/python/language/test_tilelang_language_annot.py (2)
32-34: Same issues as test_tensor_annot_mul.This test has the same problems identified in the first test:
- Missing issue reference in the comment (lines 32-33)
- Hardcoded "cuda" device (lines 49, 51) despite the HIP-specific workaround
Please apply the same fixes as suggested for
test_tensor_annot_mul.Also applies to: 49-51
56-58: Same issues as previous tests.This test has the same problems:
- Missing issue reference in the comment (lines 56-57)
- Hardcoded "cuda" device (lines 73, 75) despite the HIP-specific workaround
Please apply the same fixes as suggested for the other test functions.
Also applies to: 73-75
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (3)
133-156: Skip reason lacks issue tracking.The skip decorators use generic reasons without referencing tracking issues for the known GEMM SS problems. This was already flagged in the previous review.
290-313: Skip reason lacks issue tracking.The skip decorators use generic reasons without referencing tracking issues. This was already flagged in the previous review.
612-634: Document the intentional FP8 test gating strategy and create tracking for precision issues.The FP8 test gating is intentional: SS and RS variants are explicitly skipped with reasons ("Temporarily disabling until GEMM SS/RS issues are resolved"), while SR and RR variants are enabled. This pattern should be documented in a test strategy comment or PR description to clarify that the inconsistency is by design.
Additionally, the TODO comments referencing
float8_e5m2fnuzprecision problems (in both SR and RR ROCm tests) lack issue tracking. Create GitHub issues or add issue numbers to these comments for better visibility and tracking.testing/python/cache/test_tilelang_kernel_cache.py (2)
199-201: Remove CUDA gating from compilation-only test.This test only compiles kernels without executing them (lines 222-227, 240-245). Given the PR objective to "Open Rocm ci test," this decorator incorrectly skips the test on ROCm CI. Remove
@tilelang.testing.requires_cudato allow cache behavior testing across all platforms.Based on previous review: compilation-only tests should run on all platforms to maximize CI coverage.
250-252: Remove CUDA gating from compilation-only test.This test only compiles a kernel (lines 274-279) without GPU execution. The
@tilelang.testing.requires_cudadecorator unnecessarily restricts ROCm CI coverage. Remove it to align with the PR goal of opening ROCm CI tests.Based on previous review: compilation-only tests should run on all platforms.
testing/python/language/test_tilelang_language_atomic_add.py (2)
214-215: Align tensor initialization with standard pattern used elsewhere in the file.The tensor initialization uses a two-step approach (float32 → conversion), which is inconsistent with the standard pattern throughout the file. All other test functions create tensors directly with the target dtype:
torch.randn(..., dtype=getattr(torch, dtype)).cuda().🔎 Proposed fix
- A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) - B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) + A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() + B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()
253-254: Add missing@requires_cudadecorator for consistency.
test_atomic_addx2_floatis missing the@tilelang.testing.requires_cudadecorator, whiletest_atomic_addx2_half(Line 248) has it. Both tests use the same underlyingatomic_addx2operation, which is CUDA-specific. This inconsistency will causetest_atomic_addx2_floatto fail on non-CUDA platforms.🔎 Proposed fix
+@tilelang.testing.requires_cuda def test_atomic_addx2_float(): run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)src/tl_templates/hip/reduce.h (1)
262-271: Usetl::shfl_xorinstead of raw__shfl_xorfor type safety.The
warp_reducefunction uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results or fail forhalf_tandbfloat16_ttypes. These types require thetl::shfl_xorwrapper (defined incommon.hlines 148-177) that converts to float before shuffling.This is inconsistent with the rest of the file where
tl::shfl_*wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).🔎 Proposed fix
template <typename T, typename ReduceOp> TL_DEVICE T warp_reduce(T value, ReduceOp op) { - value = op(value, __shfl_xor(value, 32)); - value = op(value, __shfl_xor(value, 16)); - value = op(value, __shfl_xor(value, 8)); - value = op(value, __shfl_xor(value, 4)); - value = op(value, __shfl_xor(value, 2)); - value = op(value, __shfl_xor(value, 1)); + value = op(value, tl::shfl_xor(value, 32)); + value = op(value, tl::shfl_xor(value, 16)); + value = op(value, tl::shfl_xor(value, 8)); + value = op(value, tl::shfl_xor(value, 4)); + value = op(value, tl::shfl_xor(value, 2)); + value = op(value, tl::shfl_xor(value, 1)); return value; }src/tl_templates/hip/debug.h (1)
40-49: Missingunsigned long longspecialization.The trait defines
long longbut notunsigned long long. This could cause the fallback template to be used foruint64_tvalues, printing them as pointers instead of their actual values.🔎 Proposed fix
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); +DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
🧹 Nitpick comments (5)
testing/python/language/test_tilelang_language_ptr.py (1)
61-62: Consider adding CUDA availability check.The test creates CUDA tensors but lacks a CUDA availability decorator (e.g.,
@tilelang.testing.requires_cuda). The AI summary indicates other test files in this PR are receiving similar gating decorators for backend availability. Adding this decorator would prevent test failures when CUDA is unavailable.🔎 Suggested decorator
+@tilelang.testing.requires_cuda def test_matmul(): run_matmul(1024, 1024, 1024, 128, 128, 32)testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
432-441: Document precision problems with issue tracking.The TODO comments about ROCm precision problems should be tracked in dedicated issues. Commented-out tests indicate known instabilities that need formal tracking for resolution and re-enablement.
Consider creating issues for:
- ROCm int8 precision problem (lines 432-434)
- ROCm float precision problem (lines 439-441)
595-597: Document ROCm precision problems with issue tracking.The TODO comment about ROCm precision problems with
num_stages=2should be tracked in a dedicated issue for proper resolution planning.tilelang/engine/lower.py (2)
73-74: Consider validating include paths before use.The direct concatenation of include paths could produce invalid compiler flags if the constants are
Noneor empty strings. Consider adding validation to ensure paths are valid before passing them to the compiler.🔎 Suggested validation approach
+ # Validate required include paths + if not TILELANG_TEMPLATE_PATH or not CUTLASS_INCLUDE_DIR: + raise ValueError("Required CUDA include paths not configured. Ensure TILELANG_TEMPLATE_PATH and CUTLASS_INCLUDE_DIR are set.") + options = [ "-std=c++17", "-I" + TILELANG_TEMPLATE_PATH, "-I" + CUTLASS_INCLUDE_DIR, ]
120-121: Consider validating include paths before use.Similar to the CUDA path, validate that HIP include paths are properly configured before passing them to the compiler.
🔎 Suggested validation approach
+ # Validate required include paths + if not TILELANG_TEMPLATE_PATH or not COMPOSABLE_KERNEL_INCLUDE_DIR: + raise ValueError("Required HIP include paths not configured. Ensure TILELANG_TEMPLATE_PATH and COMPOSABLE_KERNEL_INCLUDE_DIR are set.") + hsaco = hipcc.compile_hip( code, target_format="hsaco", options=[ "-std=c++17", "-I" + TILELANG_TEMPLATE_PATH, "-I" + COMPOSABLE_KERNEL_INCLUDE_DIR, ],
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (54)
.github/workflows/ci.ymlsrc/op/gemm.ccsrc/op/logical.ccsrc/target/codegen_hip.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/atomic.hsrc/tl_templates/hip/common.hsrc/tl_templates/hip/debug.hsrc/tl_templates/hip/hip_fp8.hsrc/tl_templates/hip/reduce.htesting/python/autotune/test_tilelang_autotune.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/debug/test_device_assert.pytesting/python/debug/test_tilelang_debug_print.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/issue/test_tilelang_issue_830.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_alias.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_composable_index.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_vectorize.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytilelang/engine/lower.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
- testing/python/issue/test_tilelang_issue_830.py
- testing/python/language/test_tilelang_language_copy.py
- testing/python/language/test_tilelang_language_composable_index.py
✅ Files skipped from review due to trivial changes (1)
- src/tl_templates/hip/hip_fp8.h
🚧 Files skipped from review as they are similar to previous changes (21)
- testing/python/language/test_tilelang_language_frontend_v2.py
- tilelang/intrinsics/mfma_macro_generator.py
- testing/python/language/test_tilelang_language_clear.py
- testing/python/issue/test_tilelang_issue_1008.py
- testing/python/debug/test_device_assert.py
- testing/python/language/test_tilelang_language_infinity.py
- .github/workflows/ci.yml
- testing/python/jit/test_tilelang_jit_cutedsl.py
- testing/python/carver/test_tilelang_carver_recommend_hints.py
- testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
- tilelang/intrinsics/mfma_layout.py
- src/op/gemm.cc
- testing/python/autotune/test_tilelang_autotune.py
- testing/python/components/test_storage_rewrite_detect_inplace.py
- testing/python/language/test_tilelang_language_annotate_safe_value.py
- testing/python/jit/test_tilelang_jit_nvrtc.py
- testing/python/issue/test_tilelang_issue_96.py
- testing/python/language/test_tilelang_language_vectorize.py
- testing/python/issue/test_tilelang_issue_1001.py
- testing/python/jit/test_tilelang_jit_parcompile.py
- testing/python/jit/test_tilelang_jit_callback.py
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_mask_op.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/kernel/test_tilelang_kernel_gemm.pytilelang/engine/lower.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/language/test_tilelang_language_alias.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_ptr.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_mask_op.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/language/test_tilelang_language_alias.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_atomic_add.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
tilelang/engine/lower.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Applied to files:
src/tl_templates/hip/reduce.h
🧬 Code graph analysis (13)
testing/python/language/test_tilelang_language_mask_op.py (2)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)tilelang/jit/kernel.py (1)
out_idx(609-610)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
run(121-181)run(188-259)
testing/python/kernel/test_tilelang_kernel_gemm_simt.py (1)
testing/python/kernel/test_tilelang_kernel_fp8_gemm_mma.py (1)
assert_tl_matmul_correctness(183-218)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
tl(110-200)shfl_down(138-140)shfl_down(157-161)shfl_down(182-186)shfl_xor(134-136)shfl_xor(151-155)shfl_xor(176-180)shfl(146-148)shfl(169-173)shfl(194-198)shfl_up(142-144)shfl_up(163-167)shfl_up(188-192)
tilelang/jit/adapter/wrapper.py (1)
tilelang/layout/layout.py (1)
index(46-55)
src/tl_templates/hip/common.h (2)
src/tl_templates/cuda/common.h (5)
tl(248-342)tl(571-573)tl(574-576)tl(591-671)bfloat16_t(581-582)src/tl_templates/cuda/atomic.h (1)
half_t(24-26)
src/tl_templates/hip/debug.h (5)
src/tl_templates/hip/gemm.h (3)
void(158-232)void(234-291)bfloat16_t(33-52)tilelang/language/print_op.py (1)
print_var(15-25)tilelang/layout/layout.py (1)
index(46-55)src/tl_templates/cuda/cuda_fp4.h (1)
float(195-197)src/tl_templates/cuda/reduce.h (2)
half_t(17-19)bfloat16_t(20-22)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
float32(300-300)
testing/python/language/test_tilelang_language_alias.py (1)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/v2/dtypes.py (2)
float16(299-299)float32(300-300)tilelang/language/atomic.py (1)
atomic_addx2(230-265)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (71)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
167-167: LGTM: Accumulation dtype upgrade improves numerical stability.The consistent change from
T.float16toT.float32for accumulation dtype across all matmul tests follows best practices for mixed-precision kernels. Using float32 accumulation with float16 inputs/outputs prevents precision loss and overflow during iterative accumulation operations, which is particularly important for matrix multiplications.Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306
386-386: LGTM: Appropriate CUDA-only gating for L2 persistent cache.The
@tilelang.testing.requires_cudadecorator correctly gates this test since L2 persistent cache functionality relies on CUDA-specific APIs (__tvm_cuda_stream_set_access_policy_window_packedand__tvm_cuda_stream_reset_access_policy_window_packed). These features are not available on ROCm/HIP, making this gating necessary for cross-platform CI.testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)
157-157: Correctness fix: Reference computation now uses out_dtype.This change correctly aligns the reference result with the actual kernel output type. Previously casting to
accum_dtypewas incorrect whenout_dtypediffers fromaccum_dtype. This fix is consistent with similar test files (test_tilelang_kernel_gemm_mma.pyline 193,test_tilelang_kernel_fp8_gemm_mma.pyline 213).
164-164: LGTM: Improved accumulation precision.Changing
accum_dtypefromT.float16toT.float32improves numerical accuracy during accumulation, which is especially important for matrix multiplication where many additions occur. This prevents potential precision loss while still producing a float16 output.
168-170: LGTM: CUDA-gated int8 test added.The new test validates int8 matrix multiplication with int32 accumulation. The
@tilelang.testing.requires_cudadecorator appropriately gates this test to CUDA environments, which aligns with the PR objective to enable ROCm CI while selectively skipping features not yet supported on ROCm.testing/python/jit/test_tilelang_jit_gemm.py (2)
106-106: LGTM: Function name updated to reflect accumulation dtype.The function rename from
test_gemm_f16f16f16_nn_kernel_jittotest_gemm_f16f16f32_nn_kernel_jitcorrectly reflects the change in accumulation dtype to float32, improving code clarity.
115-115: LGTM: Float32 accumulation improves numerical precision.Changing the accumulation dtype from
T.float16toT.float32addresses potential numerical stability issues in GEMM operations, which aligns with the PR's bug-fix objective.testing/python/jit/test_tilelang_jit_gemm_cython.py (1)
169-169: LGTM: Consistent float32 accumulation across Cython tests.The systematic update of accumulation dtype from
T.float16toT.float32across all test functions improves numerical stability and is consistent with the changes intest_tilelang_jit_gemm.py.Also applies to: 211-211, 255-255, 304-308, 357-357
testing/python/debug/test_tilelang_debug_print.py (7)
2-2: LGTM: pytest import added for parametrization.The pytest import is necessary for the
@pytest.mark.parametrizedecorator used in the refactored test.
20-24: Excellent refactoring: parametrized test improves coverage.The parametrized test is well-structured and covers 12 dtypes (integer types, standard floating-point types, and bfloat16) in a single test function. This improves maintainability while ensuring comprehensive dtype coverage. Platform-specific FP8 dtypes are correctly tested separately with appropriate decorators.
15-15: Approved: target parameter removed to support multi-platform testing.Removing the explicit
target="cuda"parameter allows this test to run on both CUDA and ROCm platforms by using the default target selection (viaTILELANG_TARGETenvironment variable or "auto"). This change aligns with the PR objective to enable ROCm CI testing.Based on learnings, this addresses the previously noted target issues for this file.
50-50: Consistent with multi-platform enablement.The removal of
target="cuda"is consistent with the other test functions and enables this conditional debug print test to run on both CUDA and ROCm platforms.
69-69: LGTM: consistent target parameter removal.Consistent with the multi-platform testing approach applied throughout this file.
88-88: LGTM: consistent target parameter removal.Consistent with the multi-platform testing approach.
107-107: LGTM: final target parameter removal completes the refactoring.This completes the consistent removal of explicit
target="cuda"parameters throughout the file, enabling comprehensive multi-platform testing support.testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)
156-156: The decorator additions are justified, but the PR title may be misleading regarding scope.These sparse GEMM tests use NVIDIA-specific features (hardcoded
arch="8.0"and sparse tensor cores), making the@tilelang.testing.requires_cudadecorator appropriate. Unlike the non-sparse GEMM tests intest_tilelang_tilelibrary_gemm.py—which have separate ROCm-compatible variants (test_gemm_ss_fp8_rocm, etc.)—the sparse GEMM v2 tests have no ROCm counterparts and are genuinely CUDA-only.However, the PR title "Open Rocm ci test and fix some bugs" may suggest broader ROCm CI enablement than is actually occurring with this change. Consider clarifying the PR title or description to reflect that these sparse tests remain architecture-specific to NVIDIA and are intentionally gated.
testing/python/language/test_tilelang_language_ptr.py (1)
44-54: No changes needed. The test correctly pairsexecution_backend="cython"with CUDA tensors. The cython backend is explicitly designed to handle CUDA kernels and tensor pointers, as documented in the CythonKernelAdapter class and official documentation examples. This pattern is used throughout the test suite without issue.Likely an incorrect or invalid review comment.
testing/python/language/test_tilelang_language_unroll.py (3)
20-20: Good explanatory comment.The TODO clearly explains the HIP limitation for the
unroll_factorparameter and why the test is CUDA-gated. This distinction from basic unroll withstepis helpful for future HIP support work.
21-33: Appropriate CUDA gating and consistent target usage.The
@tilelang.testing.requires_cudadecorator correctly gates this test since HIP doesn't support theunroll_factorparameter (as noted in the TODO). The change to use the default target is consistent with the first test and safe here because the decorator ensures the test only runs on CUDA platforms where the default target will be CUDA.
6-17: No action needed. The code is correctly structured.The first test uses
T.unroll(0, 16, step=4)without CUDA gating by design. The TODO comment on line 20 specifically states that "unroll factor" (theunroll_factorparameter) is not supported on HIP, not the basicunrollwithstepparameter. The second test correctly usesunroll_factorand is properly gated with@tilelang.testing.requires_cuda. The distinction between these two unroll modes is intentional and correctly reflected in the test structure.testing/python/language/test_tilelang_language_alias.py (1)
48-48: Add@tilelang.testing.requires_cudadecorator for consistency and robustness.Every other test in
testing/python/language/includes a GPU backend requirement decorator (requires_cuda, requires_hip, or similar). This test should be gated with@tilelang.testing.requires_cudato ensure it only runs when GPU support is available, preventing failures on systems without CUDA or ROCm.Alternatively, if this test is intentionally designed to be backend-agnostic and work without GPU support, clarify that design choice with a comment.
⛔ Skipped due to learnings
Learnt from: lucifer1004 Repo: tile-ai/tilelang PR: 1256 File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115 Timestamp: 2025-11-14T07:56:11.098Z Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.Learnt from: silentCoder-dev Repo: tile-ai/tilelang PR: 1464 File: testing/python/language/test_tilelang_language_rand.py:14-14 Timestamp: 2025-12-18T04:50:00.512Z Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)
114-118: LGTM! Improved accumulation precision.Changing
dtypeAccumfromT.float16toT.float32improves numerical stability and reduces accumulation errors during GEMM operations. This is a solid correctness improvement.
271-275: LGTM! Improved accumulation precision.Consistent with the SS variant changes, this improves numerical stability for the RS GEMM variant.
427-431: LGTM! Improved accumulation precision.Consistent precision improvement for the SR GEMM variant.
590-593: LGTM! Improved accumulation precision.Consistent precision improvement for the RR GEMM variant.
testing/python/language/test_tilelang_language_mask_op.py (1)
31-31: LGTM! Backend auto-detection enables ROCm compatibility.Removing the hardcoded
target="cuda"parameter allows the compilation target to be auto-detected based on the runtime environment. Thedetermine_target()function checks for CUDA availability first, then ROCm/HIP, then Metal, enabling these tests to run on ROCm CI while maintaining backward compatibility with CUDA systems.Verify that these tests pass on ROCm CI with the auto-detected target.
testing/python/language/test_tilelang_language_vectorized_cast.py (1)
78-78: LGTM! Proper CUDA gating for CUDA-specific test.The decorator correctly gates this test to CUDA-only execution, which is appropriate since the test validates CUDA-specific vectorized cast intrinsics (e.g.,
__float22half2_rn,__nv_cvt_float2_to_fp8x2) and usesdevice="cuda". This aligns with the broader pattern across the PR to ensure tests only run in their target environments.testing/python/language/test_tilelang_language_let.py (1)
6-6: LGTM! Appropriate CUDA gating.The decorator correctly ensures this test only runs with CUDA available, which is necessary since the test compiles with
target="cuda"and validates CUDA-specific code generation.testing/python/language/test_tilelang_language_var_init.py (1)
6-7: LGTM! Clear documentation of platform limitation.The TODO comment helpfully explains why this test is CUDA-only (var init not supported on HIP), and the decorator correctly gates the test accordingly. This provides good context for future maintenance.
testing/python/kernel/test_tilelang_kernel_gemm.py (7)
106-106: LGTM! Appropriate CUDA gating for f16f16f16 accumulation.The decorator correctly restricts this f16 accumulation test to CUDA, consistent with the selective gating pattern applied across similar f16f16f16 variants in this file.
172-172: LGTM! Consistent CUDA gating.Appropriately gates the transposed-A variant to CUDA, consistent with other f16f16f16 tests.
190-190: LGTM! Consistent CUDA gating.Appropriately gates the transposed-B variant to CUDA, consistent with other f16f16f16 tests.
216-216: LGTM! Appropriate CUDA gating for f64 test.The decorator correctly restricts the float64 GEMM test to CUDA.
237-238: LGTM! Well-documented temporary ROCm disablement.The TODO clearly explains the precision issue on ROCm for this specific f32 transpose combination, and the CUDA gating appropriately restricts the test until the issue is resolved. This provides good documentation for tracking the outstanding work.
255-255: LGTM! Consistent padding test gating.Appropriately gates the aligned padding test to CUDA, consistent with other f16f16f16 tests.
273-273: LGTM! Consistent padding test gating.Appropriately gates the unaligned padding test to CUDA, consistent with other f16f16f16 tests.
testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)
199-199: LGTM! Appropriate CUDA gating for INT4 tensor core test.The decorator correctly gates this test to CUDA-only execution, which is necessary since the test uses
INT4TensorCoreIntrinEmitterfor CUDA-specific tensor core operations and validates generated CUDA source code.
403-403: LGTM! Appropriate CUDA gating for weight-only transform test.The decorator correctly gates this test to CUDA-only execution, consistent with the INT4 tensor core operations that require CUDA.
testing/python/language/test_tilelang_language_alloc.py (4)
36-36: LGTM: Flexible assertion for codegen variations.The relaxed assertions correctly accommodate different code generation patterns for
T.alloc_var(), accepting both scalar (tmp =) and array-style (tmp[0] =) representations.Also applies to: 76-76
154-156: LGTM: More precise assertions with kernel_only=True.Using
kernel_only=Trueisolates the kernel source from wrapper code, and the count-based assertions (code.count("= 1;") == 1) verify that each initializer appears exactly once, making the test more robust.
159-162: Verify if ROCm truly lacks alloc_var initializer support.Same concern as lines 118-121: this compilation-only test is gated to CUDA despite the PR goal to "Open Rocm ci test." If the ROCm limitation is only at runtime, consider allowing this test to run on ROCm CI.
See verification script at lines 118-121.
118-121: Verify ROCm alloc_var initializer support with backend-specific codegen testing.The observation that this test only compiles and inspects source code—not GPU execution—is accurate. However, multiple tests across the codebase share this exact TODO, suggesting a genuine codegen limitation rather than overly conservative gating. Before relaxing the CUDA requirement, confirm whether:
- ROCm's
tilelang.compile()path actually produces incorrect or unsupported syntax foralloc_varinitializers- The underlying issue has since been fixed in ROCm backend codegen
If confirmed, update the TODO to reference specific ROCm backend ticket or constraint. If already resolved, remove the decorator and test on ROCm CI.
testing/python/cache/test_tilelang_kernel_cache.py (1)
121-123: LGTM: Correct CUDA gating for GPU execution.This test correctly requires CUDA because it executes kernels on GPU (lines 187-191:
.cuda()calls and kernel invocations with GPU tensors).testing/python/language/test_tilelang_language_atomic_add.py (3)
210-211: LGTM!The dtype parameter is properly propagated to
atomic_addx2_programand the default value is consistent.
238-238: LGTM!Adding
@tilelang.testing.requires_cudadecorators is appropriate since atomic operations are CUDA-specific. This ensures tests skip gracefully on non-CUDA platforms.Also applies to: 243-243
353-357: LGTM!Adding the
@tilelang.testing.requires_cudadecorator and testing multiple dtypes (float32, float16, bfloat16) improves both portability and coverage.src/op/logical.cc (2)
45-46: LGTM! HIP lowering support added correctly.The HIP lowering intrinsic registration for
tl.any_ofmirrors the CUDA pattern and correctly reuses the sameany_of_opimplementation.
53-54: LGTM! HIP lowering support added correctly.The HIP lowering intrinsic registration for
tl.all_ofmirrors the CUDA pattern and correctly reuses the sameall_of_opimplementation.src/tl_templates/hip/common.h (4)
3-5: LGTM! Required headers added for HIP atomics and warp functions.The includes for
atomic.handamd_warp_functions.hproperly support the newtlnamespace utilities.
110-130: LGTM! Any and All helpers correctly implemented.The
AnyandAllfunctions provide simple loop-based reductions that match the expected semantics for logical operations on arrays.
150-198: Verify half_t and bfloat16_t shuffle specializations.The shuffle specializations for
half_tandbfloat16_tcast to/fromfloatbefore shuffling. This approach should work correctly, but confirm that:
- The float conversion preserves the intended precision semantics
- The pattern matches the CUDA implementation in src/tl_templates/cuda/common.h
Based on the relevant code snippets, the CUDA implementation uses the same pattern with casts to/from float.
132-148: The current non-sync shuffle implementation appears to handle thread synchronization appropriately.The code uses compile-time branching (
if constexpr) to segregate intra-warp communication (via__shfl_*operations) from inter-warp communication (via__syncthreads()and shared memory). Within warp-level shuffles, unconditional execution ensures all participating lanes execute uniformly. The TODO comment about ROCm 7.1.1 providingshfl_syncremains valid as a future modernization opportunity, but the current approach is sound for the intended use cases.src/tl_templates/hip/atomic.h (1)
1-104: LGTM! HIP atomic helpers correctly implemented with documented limitation.The atomic operation helpers are well-structured and correctly delegate to HIP's native atomic functions. The comments clearly document that the
memory_orderparameter is unused and exists only for lowering compatibility. According to past review comments, the memory ordering limitation was addressed in commits 14067c3 to d592f8b.The vectorized atomic operations (
AtomicAddx2,AtomicAddx4, etc.) correctly handle per-element operations onfloat2andfloat4types.src/tl_templates/cuda/reduce.h (4)
185-186: LGTM! SEG parameterization improves flexibility.Changing from a fixed warp size of 32 to the template parameter
SEGallows the cumulative sum to work with different segment sizes, improving code reusability.
190-191: LGTM! Boundary check prevents out-of-bounds access.The early return when
gRow >= Hcorrectly handles the case when the number of rows doesn't evenly divide byTILE_H.
195-221: LGTM! Axis-driven reverse cumulative sum correctly implemented.The reverse path now computes
real_rowandreal_colbased on theAxistemplate parameter, allowing cumulative sums along either dimension. The logic correctly:
- Iterates segments in reverse order
- Applies per-segment reductions with shuffle operations
- Propagates carry values across segments
- Performs proper boundary checks before writing results
223-247: LGTM! Axis-driven forward cumulative sum correctly implemented.The forward path mirrors the reverse implementation with
Axis-based indexing, correctly:
- Iterates segments in forward order
- Applies per-segment reductions with shuffle operations
- Propagates carry values across segments
- Performs proper boundary checks before writing results
src/target/codegen_hip.cc (1)
985-994: Alltl::warp_reduce_*functions are properly implemented insrc/tl_templates/hip/reduce.h(lines 273–291) with correct HIP device syntax and expected warp-level reduction semantics.src/tl_templates/hip/reduce.h (4)
76-76: LGTM! Correct use of type-safe shuffle wrappers.The changes to use
tl::shfl_down(line 76) andtl::shfl_xor(line 107) ensure proper handling ofhalf_tandbfloat16_ttypes by leveraging the specializations incommon.h.Also applies to: 107-107
117-182: LGTM! CumSum1D implementation is correct.The 1D cumulative sum implementation correctly:
- Uses type-safe
tl::shfl_*wrappers throughout- Handles both forward and reverse modes with proper carry propagation across segments
- Includes appropriate boundary checks for input size
184-260: LGTM! CumSum2D implementation is correct.The 2D cumulative sum implementation correctly:
- Uses type-safe
tl::shfl_*wrappers throughout- Properly handles axis selection with appropriate row/column index mapping
- Implements both forward and reverse modes with correct carry propagation
- Includes proper boundary checks for both dimensions
273-291: LGTM! Wrapper functions are correctly implemented.The
warp_reduce_*wrapper functions are properly structured and use the appropriate reduction operators. However, they depend on thewarp_reducefunction (lines 262-271), which needs to be fixed to usetl::shfl_xorinstead of raw__shfl_xorintrinsics.src/tl_templates/hip/debug.h (5)
6-21: LGTM! Well-designed fallback template.The primary template provides a sensible fallback for unknown types by printing their address. This ensures the system handles any type gracefully.
23-38: LGTM! Clean macro design.The macro effectively reduces code duplication by generating specialized PrintTraits implementations with consistent structure.
60-90: LGTM! Well-implemented explicit specializations.The
boolspecialization correctly prints "true"/"false" strings, and the pointer specialization appropriately handles pointer types by printing their address.
92-101: LGTM! Clean delegation to trait system.The public API functions properly delegate to the trait-based implementation while maintaining backward-compatible signatures.
53-57: Review focus correction: Verify fp8 type conversions specifically.The macro uses C-style casts to float. This works for
half_t(which is_Float16) andbfloat16_t(which iship_bfloat16)—both are standard HIP types with implicit float conversion support.However,
fp8_e4_tandfp8_e5_tare typealiases to raw HIP FP8 types (__hip_fp8_e4m3_fnuzand__hip_fp8_e5m2_fnuz). These do not have operator float() definitions in the codebase. Verify whether the HIP runtime provides implicit float conversion for these raw FP8 types; if not, explicit conversion functions may be needed.tilelang/jit/adapter/wrapper.py (4)
314-316: Good refactoring for maintainability.Extracting the declaration parsing logic into a dedicated method improves code organization and enables subclass-specific overrides (as demonstrated by the HIP implementation).
592-592: LGTM!The
uint64touint64_ttype mapping is correct and consistent with other integer type mappings in the class.
585-585: Mapping is correct. The typefp8_e5_tis properly defined in HIP templates (src/tl_templates/hip/hip_fp8.hline 10) as an alias for__hip_fp8_e5m2_fnuz, confirming the mapping of"float8_e5m2fnuz"to"fp8_e5_t"is sound.
612-615: The HIP-specific parsing correctly usessplit("{")[0]to extract the kernel signature. This is the appropriate approach for HIP kernels, which transition directly to the function body rather than using semicolon-terminated declarations like CUDA. No robustness concerns exist—the implementation is correct as-is.
| {"float8_e5m2fnuzx4", "fp8_e5_4_t"}, | ||
| {"float8_e5m2fnuzx8", "long"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file context around lines 945-946
sed -n '935,955p' src/target/codegen_hip.ccRepository: tile-ai/tilelang
Length of output: 890
🏁 Script executed:
# Also search for other FP8 mappings to understand the pattern
rg -A5 -B5 "float8_e5m2fnuzx" src/target/codegen_hip.ccRepository: tile-ai/tilelang
Length of output: 638
🏁 Script executed:
# Search for type definitions related to fp8_e5_4_t
rg "fp8_e5_4_t|fp8.*e5" --type cc --type hRepository: tile-ai/tilelang
Length of output: 86
🌐 Web query:
ROCm MFMA FP8 e5m2 data type support and mappings hip header
💡 Result:
Short summary of ROCm FP8 (E5M2) / MFMA HIP mappings and header names
-
HIP C typedefs / builtins:
- __hip_fp8_e5m2 — standard OCP E5M2 FP8 type. [1]
- __hip_fp8_e5m2_fnuz — FNUZ variant (no inf, signed-zero handling per AMD). [1][2]
-
hipDataType enum entries (library_types.h):
- HIP_R_8F_E5M2 = 29 (maps to __hip_fp8_e5m2). [1]
- HIP_R_8F_E5M2_FNUZ = 1001 (maps to __hip_fp8_e5m2_fnuz). [1]
-
Notes about MFMA / hardware support:
- FP8 (E5M2) is supported as an internal compute type on CDNA3 / MI300-class Matrix cores (MFMA/Matrix units); library support (rocWMMA/rocBLAS/hipBLAS/etc.) depends on hardware and ROCm release. [2][1]
References
[1] ROCm “Data types and precision support” / hip type & hipDataType listings.
[2] ROCm “Precision support” (notes on E5M2, FNUZ and MI300/Matrix cores).
🏁 Script executed:
# Search for fp8_e5_4_t and fp8_e4_4_t type definitions
rg "fp8_e[45]_4_t" src/Repository: tile-ai/tilelang
Length of output: 2784
🏁 Script executed:
# Check if these types are defined in HIP headers or locally
fd -e h | xargs grep -l "fp8_e[45]_4_t" 2>/dev/null || echo "Not found in headers"Repository: tile-ai/tilelang
Length of output: 123
🏁 Script executed:
# Search more broadly for FP8 type definitions
rg "typedef.*fp8|struct.*fp8" --type-list | head -20Repository: tile-ai/tilelang
Length of output: 449
🏁 Script executed:
# Look for where these mappings are used/applied
rg "float8_e5m2fnuzx4|float8_e4m3fnuzx4" src/Repository: tile-ai/tilelang
Length of output: 201
🏁 Script executed:
# Read the hip_fp8.h file to see all FP8 type definitions
cat -n src/tl_templates/hip/hip_fp8.h | head -100Repository: tile-ai/tilelang
Length of output: 3006
🏁 Script executed:
# Check if there's an 8-element FP8 type
rg "fp8_e[45]_8_t|e[45]m[23].*8" src/tl_templates/hip/hip_fp8.hRepository: tile-ai/tilelang
Length of output: 321
🏁 Script executed:
# Look at the full structure of fp8_e5_8_t and understand the x8 mapping
sed -n '1,200p' src/tl_templates/hip/hip_fp8.hRepository: tile-ai/tilelang
Length of output: 5884
🏁 Script executed:
# Look at broader context around the dtype map definition
sed -n '920,960p' src/target/codegen_hip.ccRepository: tile-ai/tilelang
Length of output: 1932
🏁 Script executed:
# Search for any comments or documentation about x8 → long mapping
rg -B10 "float8.*fnuzx8.*long" src/target/codegen_hip.ccRepository: tile-ai/tilelang
Length of output: 525
🏁 Script executed:
# Check if there are MFMA builtin requirements documented
rg "mfma.*float8|MFMA.*FP8" src/target/codegen_hip.cc | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Search for any test or usage of these x8 mappings
rg "float8_e[45]m[23]fnuzx8" src/ --type cc --type hRepository: tile-ai/tilelang
Length of output: 86
Fix FP8 x8 dtype mappings to use proper struct types.
The x8 FP8 dtype mappings use long instead of the semantically correct struct types defined in hip_fp8.h. The correct mappings should be:
float8_e4m3fnuzx8→fp8_e4_8_t(currently maps tolong)float8_e5m2fnuzx8→fp8_e5_8_t(currently maps tolong)
Both fp8_e4_8_t and fp8_e5_8_t are fully defined in src/tl_templates/hip/hip_fp8.h with proper member access and alignment. Using long is inconsistent with how x4 variants are correctly mapped to their struct types (fp8_e4_4_t and fp8_e5_4_t) and loses type safety for MFMA operations.
🤖 Prompt for AI Agents
In src/target/codegen_hip.cc around lines 945-946 the FP8 x8 dtype mappings
incorrectly use "long"; replace the two mappings so "float8_e4m3fnuzx8" maps to
"fp8_e4_8_t" and "float8_e5m2fnuzx8" maps to "fp8_e5_8_t" and ensure the file
includes the hip_fp8.h header (or the header that defines fp8_e4_8_t and
fp8_e5_8_t) so the struct types are available for MFMA/type-safe operations.
| static TL_DEVICE void run(const T *__restrict__ src, T *__restrict__ dst, | ||
| int H, int W) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual file to confirm the current signature
head -n 200 src/tl_templates/cuda/reduce.h | tail -n 30Repository: tile-ai/tilelang
Length of output: 1065
🏁 Script executed:
# Search for all calls to CumSum2D::run
rg -n 'CumSum2D.*::run\s*\(' --type=cpp --type=h -C3Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Also search more broadly for CumSum2D usage
rg -n 'CumSum2D' --type=cpp --type=hRepository: tile-ai/tilelang
Length of output: 365
🏁 Script executed:
# Look for any assignment from CumSum2D::run calls
rg -n 'CumSum2D.*run\s*\(' . --type=cpp --type=h -A2 -B2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check the reduce.cc file around line 551 to see how CumSum2D is used
sed -n '540,560p' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 919
🏁 Script executed:
# Search for actual invocations/calls of CumSum2D::run
rg 'CumSum2D' -A5 -B5 --type=cc --type=h | head -80Repository: tile-ai/tilelang
Length of output: 86
🏁 Script executed:
# Look for any kernel launches or calls that might use CumSum2D
rg -n '\.run\s*\(' src/op/reduce.cc -A3 -B3Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check the full context of CumSum2D in reduce.cc
sed -n '545,570p' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 1101
🏁 Script executed:
# Check the exact lines again with more context
sed -n '545,570p' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 1101
🏁 Script executed:
# Look for any mismatch between Call return type and function signature
rg 'Evaluate\(Call\(' src/op/reduce.cc -B10 | head -40Repository: tile-ai/tilelang
Length of output: 567
🏁 Script executed:
# Check if there are TODO/FIXME comments related to CumSum
rg -i 'TODO|FIXME' src/op/reduce.cc | grep -i cumsumRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Verify the current signature of CumSum1D and CumSum2D to understand the pattern
rg 'CumSum1D.*run' -A2 src/tl_templates/cuda/reduce.hRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check the full CumSum implementation pattern to understand intent
cat src/tl_templates/cuda/reduce.h | head -250 | tail -80Repository: tile-ai/tilelang
Length of output: 2419
🏁 Script executed:
# Look for other uses of call_extern with different return types
rg 'call_extern' src/op/reduce.cc -B3 -A3Repository: tile-ai/tilelang
Length of output: 1198
🏁 Script executed:
# Check if there are any type conversion or void handling in the call
rg -A20 'CumSum2D.*::run' src/tl_templates/cuda/reduce.h | head -30Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check CumSum1D to see if it also returns void
rg -B2 -A5 'static TL_DEVICE.*run.*CumSum1D' src/tl_templates/cuda/reduce.hRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Look at the full context around the Call to understand what dst->dtype represents
sed -n '520,575p' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 2195
🏁 Script executed:
# Find CumSum1D::run signature
rg -A3 'struct CumSum1D' src/tl_templates/cuda/reduce.h | head -30Repository: tile-ai/tilelang
Length of output: 298
🏁 Script executed:
# Check if there are any recent changes or git history
git log --oneline src/op/reduce.cc | head -10Repository: tile-ai/tilelang
Length of output: 90
🏁 Script executed:
# Check for any compilation errors or warnings
rg 'void' src/tl_templates/cuda/reduce.h | grep -i cumsumRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Get the full CumSum1D::run signature
rg -A8 'struct CumSum1D' src/tl_templates/cuda/reduce.h | grep -A5 'static TL_DEVICE'Repository: tile-ai/tilelang
Length of output: 203
🏁 Script executed:
# Check git history to see if this was a recent change
git log -p --follow -- src/tl_templates/cuda/reduce.h | grep -B5 -A5 'CumSum.*void' | head -30Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Look at the last commit that touched this file
git log --oneline -n 5 -- src/tl_templates/cuda/reduce.hRepository: tile-ai/tilelang
Length of output: 90
🏁 Script executed:
# Check if there's special handling for void in call_extern
rg 'call_extern' src/ -B2 -A2 | grep -B5 -A5 'void'Repository: tile-ai/tilelang
Length of output: 1550
🏁 Script executed:
# Look for any Cast operations around the CumSum call
sed -n '535,575p' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 1532
🏁 Script executed:
# Check the type of builtin::call_extern to understand expected return type
rg 'builtin::call_extern' src/ -B3 | head -30Repository: tile-ai/tilelang
Length of output: 2119
🏁 Script executed:
# Verify if there's a mismatch between function signature and codegen
rg -A1 'Call.*dst.*dtype.*call_extern' src/op/reduce.ccRepository: tile-ai/tilelang
Length of output: 139
🏁 Script executed:
# Look for other examples of Call with void return type
rg 'Call\(.*void' src/ --type=cpp | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if DataType::Void() exists
rg 'DataType::Void\|void_dtype' src/ --type=cppRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Examine how other void-returning functions are called via call_extern
rg -B5 'DataType::Handle' src/op/ | grep -A3 'call_extern'Repository: tile-ai/tilelang
Length of output: 457
🏁 Script executed:
# Check if the mismatch between Call(dst->dtype) and void return has any handling
git log -p --all -- src/op/reduce.cc | grep -B10 -A10 'CumSum2D.*void' | head -40Repository: tile-ai/tilelang
Length of output: 42
Fix code generator to match void return type of CumSum functions.
Both CumSum1D::run and CumSum2D::run now return void, writing results directly to dst. However, the code generator in src/op/reduce.cc line 570 incorrectly declares the return type as dst->dtype. Update the call to use DataType::Handle() instead, matching the pattern used elsewhere for void-returning call_extern invocations (see src/op/gemm.cc and src/op/copy.cc).
Current code generation (reduce.cc:570)
return Evaluate(Call(dst->dtype, builtin::call_extern(), args));🤖 Prompt for AI Agents
In src/op/reduce.cc around line 570, the code generator incorrectly emits a
return type of dst->dtype for CumSum call_extern invocations even though
CumSum1D::run and CumSum2D::run are void and write directly into dst; change the
emitted return type to DataType::Handle() (the void/handle pattern used in
src/op/gemm.cc and src/op/copy.cc) so the generated call uses DataType::Handle()
for call_extern and the Evaluate(...) call matches the void-returning extern
function.
| # There is a known issue where the cython execution backend fails to build with T.symbolic. | ||
| # Forcing the TVM FFI execution backend to avoid the issue on HIP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add issue reference for better tracking.
The comment explains the workaround but doesn't include an issue reference. Based on past review feedback, you mentioned opening an issue for this cython backend bug. Please add the issue number here so maintainers know when it's safe to remove this workaround.
🔎 Suggested improvement
-# There is a known issue where the cython execution backend fails to build with T.symbolic.
-# Forcing the TVM FFI execution backend to avoid the issue on HIP.
+# There is a known issue where the cython execution backend fails to build with T.symbolic (see issue #XXXX).
+# Forcing the TVM FFI execution backend to avoid the issue on HIP.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # There is a known issue where the cython execution backend fails to build with T.symbolic. | |
| # Forcing the TVM FFI execution backend to avoid the issue on HIP. | |
| # There is a known issue where the cython execution backend fails to build with T.symbolic (see issue #XXXX). | |
| # Forcing the TVM FFI execution backend to avoid the issue on HIP. |
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_annot.py around lines 8-9, the
workaround comment for forcing the TVM FFI execution backend lacks an issue
reference; update the comment to include the tracking issue number or URL (e.g.,
"See issue #<NNNN>" or full GitHub issue link). If the issue hasn't been opened
yet, create one describing the cython backend/T.symbolic build failure on HIP
and then insert that new issue number/link in the comment so maintainers know
when to remove the workaround.
| @tilelang.testing.requires_cuda | ||
| @pytest.mark.parametrize( | ||
| "M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads", | ||
| [ | ||
| (128, 128, 128, True, True, T.float8_e5m2, T.float8_e5m2, T.float32, 128, 128, 32, 2, 128), | ||
| ], | ||
| ) | ||
| def test_gemm_sr_fp8_cuda(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads): | ||
| run_gemm_sr(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads) | ||
|
|
||
|
|
||
| @tilelang.testing.requires_rocm | ||
| @pytest.mark.parametrize( | ||
| "M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads", | ||
| [ | ||
| # TODO: There is precision problem needs to repair | ||
| # (128, 128, 128, True, True, T.float8_e5m2fnuz, T.float8_e5m2fnuz, T.float32, 128, 128, 32, 2, 128), | ||
| (128, 128, 128, True, True, T.float8_e4m3fnuz, T.float8_e4m3fnuz, T.float32, 128, 128, 32, 2, 128), | ||
| ], | ||
| ) | ||
| def test_gemm_sr_fp8_rocm(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads): | ||
| run_gemm_sr(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the test file and check its line count
wc -l testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.pyRepository: tile-ai/tilelang
Length of output: 124
🏁 Script executed:
# Search for all FP8 test definitions to see the complete picture
rg -n "def test_gemm.*fp8" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -A 2Repository: tile-ai/tilelang
Length of output: 2320
🏁 Script executed:
# Search for skip markers in relation to FP8 tests
rg -n "skip|pytest.mark.skip" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -B 5 | head -100Repository: tile-ai/tilelang
Length of output: 1956
🏁 Script executed:
# Check the context around SS and RS FP8 tests to see if they have skip markers
rg -n "test_gemm_ss_fp8|test_gemm_rs_fp8|test_gemm_rr_fp8" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -B 8 -A 5Repository: tile-ai/tilelang
Length of output: 4594
SR and RR FP8 tests are enabled while SS and RS remain skipped—document the rationale.
The test gating is intentionally differentiated by GEMM variant: SS and RS variants are skipped with explicit reasons ("GEMM SS/RS issues are resolved"), while SR and RR variants run without skip markers. This pattern suggests variants have different maturity levels, but the decision rationale is not documented in the code.
Additionally, both SR and RR ROCm tests contain identical untracked TODO comments about float8_e5m2fnuz precision problems (lines 463-464, 627-629). Create tracking issues for these precision problems and add inline documentation explaining why SR/RR are enabled while SS/RS are disabled.
🤖 Prompt for AI Agents
In testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py around lines 448
to 470, document why SR and RR GEMM FP8 tests are enabled while SS and RS are
skipped, and replace the untracked TODOs about float8_e5m2 precision with links
to tracking issues: create one or two issues in the repo for the float8_e5m2fnuz
precision problem (and any related float8 variants) and update the two ROCm TODO
comments (and the earlier SS/RS skip comments) to reference those issue
numbers/URLs and a short rationale sentence stating why SR/RR are considered
stable enough to run while SS/RS remain skipped pending fixes; ensure both ROCm
occurrences reference the same tracking issue so future readers can find the bug
report.
| from tvm.ir import CallingConv | ||
| from tvm.target import Target | ||
| from tilelang.contrib import hipcc, nvcc | ||
| from tilelang.env import COMPOSABLE_KERNEL_INCLUDE_DIR, CUTLASS_INCLUDE_DIR, TILELANG_TEMPLATE_PATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify that required path constants are defined in tilelang/env.py
echo "=== Checking for constant definitions in tilelang/env.py ==="
rg -n "^(TILELANG_TEMPLATE_PATH|CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR)\s*=" tilelang/env.py
echo -e "\n=== Checking if constants are exported ==="
rg -n "^__all__" tilelang/env.py -A 20
echo -e "\n=== Full contents of tilelang/env.py to understand context ==="
head -100 tilelang/env.pyRepository: tile-ai/tilelang
Length of output: 4415
🏁 Script executed:
#!/bin/bash
# Check imports at the beginning of tilelang/env.py
echo "=== Imports in tilelang/env.py ==="
head -50 tilelang/env.py | grep -E "^import|^from"
# Check context around lines 372-374
echo -e "\n=== Context around constant definitions (lines 350-400) ==="
sed -n '350,400p' tilelang/env.py
# Check for error handling related to these constants
echo -e "\n=== Searching for error handling or validation ==="
rg -n "CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR|TILELANG_TEMPLATE_PATH" tilelang/env.py | head -30Repository: tile-ai/tilelang
Length of output: 2144
🏁 Script executed:
#!/bin/bash
# Find the env object definition
echo "=== Searching for 'env =' in tilelang/env.py ==="
rg -n "^env\s*=" tilelang/env.py
# Check for EnvVar class definition
echo -e "\n=== Searching for EnvVar class ==="
rg -n "class EnvVar" tilelang/env.py -A 10
# Check full initialization around those lines
echo -e "\n=== Context around EnvVar initialization (lines 200-240) ==="
sed -n '200,240p' tilelang/env.pyRepository: tile-ai/tilelang
Length of output: 2435
🏁 Script executed:
#!/bin/bash
# Get full EnvVar class definition
echo "=== EnvVar class definition ==="
sed -n '125,180p' tilelang/env.py
# Check if there's a get() method
echo -e "\n=== EnvVar methods ==="
rg -n "def " tilelang/env.py | grep -A 5 -B 5 "EnvVar"
</parameter>Repository: tile-ai/tilelang
Length of output: 2804
🏁 Script executed:
#!/bin/bash
# Get full EnvVar class definition
echo "=== EnvVar class definition (lines 125-200) ==="
sed -n '125,200p' tilelang/env.pyRepository: tile-ai/tilelang
Length of output: 3431
🏁 Script executed:
#!/bin/bash
# Check how the constants are used in lower.py
echo "=== Usage of imported constants in lower.py ==="
rg -n "TILELANG_TEMPLATE_PATH|CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR" tilelang/engine/lower.py -A 3 -B 1
# Get full lower.py to understand context
echo -e "\n=== Full lower.py file ==="
wc -l tilelang/engine/lower.pyRepository: tile-ai/tilelang
Length of output: 931
🏁 Script executed:
#!/bin/bash
# Check if constants can be None
echo "=== Checking EnvVar dataclass decorator ==="
sed -n '120,130p' tilelang/env.py
# Check the initialization logic more carefully
echo -e "\n=== Full initialization section (lines 340-375) ==="
sed -n '340,375p' tilelang/env.py
# Check if there's any None-handling in lower.py
echo -e "\n=== Check for None handling in lower.py around those lines ==="
sed -n '60,130p' tilelang/engine/lower.pyRepository: tile-ai/tilelang
Length of output: 4147
Add validation to ensure required paths are available before use.
The constants are properly defined in tilelang.env and initialized with path discovery logic. However, when paths are not found, the constants default to None, which will cause a TypeError when lower.py attempts string concatenation with -I at lines 73–74 and 120–121. Add a runtime check to either:
- Validate that these constants are non-None before using them
- Raise an informative error at import time if required paths are unavailable
- Provide fallback defaults if installation layouts vary
🤖 Prompt for AI Agents
In tilelang/engine/lower.py around line 13 (and referencing uses at lines 73–74
and 120–121), the imported path constants from tilelang.env may be None and are
later concatenated with "-I", causing a TypeError; add a runtime validation at
module import (or before the first use) that COMPOSABLE_KERNEL_INCLUDE_DIR,
CUTLASS_INCLUDE_DIR and TILELANG_TEMPLATE_PATH are not None, and if any are
missing raise a clear ImportError/RuntimeError explaining which path is missing
and how to provide it (e.g., install prerequisites or set environment
variables), or alternatively supply documented fallback defaults if appropriate
for your project layout. Ensure the check reports the specific constant name and
its expected purpose so the error is actionable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (15)
.github/workflows/ci.yml (1)
376-376: Grammar issue already flagged in previous review.The grammatical error in this comment ("needs to repair" → "need to be repaired") was already identified in a previous review.
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (6)
133-156: Add tracking issue references to skip decorators.The FP8 tests for the SS GEMM variant are skipped with a generic reason. Create tracking issues for the "GEMM SS issues" referenced in the skip decorator, then update the decorator to include the issue number for traceability (e.g.,
reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX").Based on past review comments, this issue was previously flagged but not addressed.
290-313: Add tracking issue references to skip decorators.The FP8 tests for the RS GEMM variant are skipped with a generic reason. Create tracking issues for the "GEMM RS issues" and update the skip decorator to include the issue number for traceability.
Based on past review comments, this issue was previously flagged but not addressed.
432-441: Replace untracked TODO comments with tracking issue references.The TODO comments about precision problems on ROCm (lines 432-434 for int8, lines 439-441 for float) lack tracking issue references. Create tracking issues for these precision problems and replace the TODO comments with issue references for traceability.
Based on past review comments, untracked TODOs were previously flagged but not addressed.
448-470: Document the rationale for enabling SR FP8 tests while SS/RS remain disabled.The SR GEMM variant FP8 tests are enabled (no skip decorator) while SS and RS variants remain skipped. This suggests different maturity levels, but the decision rationale is not documented. Additionally, the TODO comment about
float8_e5m2fnuzprecision problems (lines 463-464) lacks a tracking issue reference.Create tracking issues for the precision problems and add inline documentation explaining why SR/RR variants are considered stable enough to run while SS/RS remain skipped pending fixes.
Based on past review comments, this issue was previously flagged but not addressed.
595-597: Replace untracked TODO comments with tracking issue references.The TODO comments about precision problems on ROCm (lines 595-597 and 627-629) both reference
float8_e5m2fnuzprecision issues but lack tracking issue references. Create a tracking issue for this precision problem and replace the TODO comments with issue references for traceability.Based on past review comments, untracked TODOs were previously flagged but not addressed.
Also applies to: 627-629
612-634: Document the rationale for enabling RR FP8 tests while SS/RS remain disabled.The RR GEMM variant FP8 tests are enabled (no skip decorator) while SS and RS variants remain skipped. Add inline documentation explaining why RR (like SR) is considered stable enough to run while SS/RS remain skipped pending fixes.
Based on past review comments, this issue was previously flagged but not addressed.
testing/python/cache/test_tilelang_kernel_cache.py (2)
199-199: Remove the CUDA decorator from this compilation-only test.This test only calls
tilelang.compile()without any GPU execution. The@tilelang.testing.requires_cudadecorator is inappropriate and will skip this test on ROCm CI, contradicting the PR objective to "Open Rocm ci test."Based on past review comments, this issue was previously flagged but not addressed.
250-250: Remove the CUDA decorator from this compilation-only test.This test only calls
tilelang.compile()without any GPU execution. The@tilelang.testing.requires_cudadecorator is inappropriate and will skip this test on ROCm CI, contradicting the PR objective to "Open Rocm ci test."Based on past review comments, this issue was previously flagged but not addressed.
src/tl_templates/hip/debug.h (1)
40-49: Missing unsigned long long specialization.The past review comment about the missing
unsigned long longspecialization has not been addressed. Line 49 defineslong longbut withoutunsigned long long,uint64_tvalues may fall back to the pointer-printing trait instead of printing their actual numeric values.🔎 Proposed fix
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long); +DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);src/tl_templates/hip/reduce.h (1)
262-271: Usetl::shfl_xorfor type safety with half/bfloat16 types.This function uses raw
__shfl_xorintrinsics which will produce incorrect results forhalf_tandbfloat16_ttypes. These types require thetl::shfl_xorwrapper that converts through float. This is inconsistent with the rest of this file wheretl::shfl_*wrappers are correctly used.🔎 Proposed fix
template <typename T, typename ReduceOp> TL_DEVICE T warp_reduce(T value, ReduceOp op) { - value = op(value, __shfl_xor(value, 32)); - value = op(value, __shfl_xor(value, 16)); - value = op(value, __shfl_xor(value, 8)); - value = op(value, __shfl_xor(value, 4)); - value = op(value, __shfl_xor(value, 2)); - value = op(value, __shfl_xor(value, 1)); + value = op(value, tl::shfl_xor(value, 32)); + value = op(value, tl::shfl_xor(value, 16)); + value = op(value, tl::shfl_xor(value, 8)); + value = op(value, tl::shfl_xor(value, 4)); + value = op(value, tl::shfl_xor(value, 2)); + value = op(value, tl::shfl_xor(value, 1)); return value; }testing/python/language/test_tilelang_language_atomic_add.py (1)
253-254: Add@requires_cudadecorator and fix dtype foratomic_addx2.This test has two critical issues:
Missing decorator:
test_atomic_addx2_half(line 249) has@tilelang.testing.requires_cuda, but this test doesn't. Both use the same CUDA-specific operation and should both require it.Invalid dtype:
atomic_addx2is designed for paired half-precision types (float16/bfloat16) as shown in the function docstring examples. Using float32 will likely fail or produce incorrect results. Useatomic_addx4for float32 (which already has a dedicated test at line 360), or change this test to use float16 or bfloat16.Immediate fix for missing decorator
+@tilelang.testing.requires_cuda def test_atomic_addx2_float(): run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)Then either:
- Change
dtype=T.float32todtype=T.float16(orT.bfloat16)- Or remove this test if
test_atomic_addx4at line 360 already covers float32 atomic operationstesting/python/language/test_tilelang_language_mask_op.py (3)
65-66: Same verification needed as lines 31-32.This follows the same pattern as the first test - removed
target="cuda"from compile while keeping hardcodeddevice="cuda"in torch tensor creation.
100-101: Same verification needed as lines 31-32.This follows the same pattern as the other tests - removed
target="cuda"from compile while keeping hardcodeddevice="cuda"in torch tensor creation.
134-135: Same verification needed as lines 31-32.This follows the same pattern as the other tests - removed
target="cuda"from compile while keeping hardcodeddevice="cuda"in torch tensor creation.
🧹 Nitpick comments (8)
src/tl_templates/hip/atomic.h (1)
7-11: Redundantreinterpret_castin pointer overloads.The
reinterpret_cast<T1*>(address)on line 10 is a no-op sinceaddressis already of typeT1*. The same applies to other pointer overloads (lines 35, 52). This is harmless but adds unnecessary noise.🔎 Proposed simplification
template <typename T1, typename T2> __forceinline__ __device__ void AtomicAdd(T1 *address, T2 val, int memory_order = 0) { - atomicAdd(reinterpret_cast<T1 *>(address), static_cast<T1>(val)); + atomicAdd(address, static_cast<T1>(val)); }Similar changes for
AtomicMax(line 35) andAtomicMin(line 52).src/tl_templates/hip/common.h (1)
112-130: Sequential loops may be slow for large arrays.The
AnyandAllfunctions use sequential iteration on the device. This is fine for small fixed-size arrays but could be a performance bottleneck if called with largesizevalues.Consider documenting the expected use case (e.g., small compile-time-known sizes) or adding a warp-parallel variant if performance becomes an issue.
tilelang/jit/adapter/wrapper.py (1)
350-351: Consider adding error handling for edge cases.The implementation correctly extracts CUDA kernel declarations by splitting at the first semicolon. However, if the semicolon is missing or appears in an unexpected context (e.g., inside a comment or string), the method returns the entire input string, which may not be the intended behavior.
💡 Optional: Add validation or use more robust parsing
def get_declaration(self, declare_kernel_code: str) -> str: - return declare_kernel_code.split(";")[0] + parts = declare_kernel_code.split(";", 1) + if len(parts) < 2: + logger.warning(f"Expected semicolon in kernel declaration, got: {declare_kernel_code[:100]}") + return parts[0]testing/python/language/test_tilelang_language_atomic_add.py (1)
214-215: Align tensor initialization with the standard pattern.The two-step initialization (create as float32 → convert to dtype) differs from the direct-creation pattern used consistently throughout this file. Compare with lines 30-31, 63-64, 98-99, etc.:
torch.randn(..., dtype=getattr(torch, dtype)).cuda().🔎 Refactor to match standard pattern
- A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) - B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype)) + A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() + B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()If the two-step conversion is intentional (e.g., to match specific numerical behavior), add a brief comment explaining why.
testing/python/kernel/test_tilelang_kernel_gemm.py (1)
255-255: Consistent with f16 accumulation gating pattern.These padding tests follow the same gating pattern as other f16f16f16 tests, while
test_pad_f16f16f32_nn(line 291) remains ungated. This confirms the ROCm issue is related to f16 accumulation dtype rather than padding logic.Consider consolidating the ROCm compatibility constraints into documentation or a decision matrix showing:
- Which dtype combinations work on CUDA vs ROCm
- Whether the limitation is temporary (bugs to fix) or architectural (unsupported operations)
Also applies to: 273-273
testing/python/language/test_tilelang_language_mask_op.py (1)
37-141: Consider adding CUDA availability checks for consistency with related tests.While most language tests don't include GPU checks, test_tilelang_language_parallel.py (in the same directory) includes a pattern for gracefully skipping when CUDA is unavailable:
if not torch.cuda.is_available(): pytest.skip("CUDA not available")Since these tests use
device="cuda"and will fail on non-GPU systems, consider adding similar checks to the test functions for consistency.testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
32-32: Consider parameterizing the device for multi-backend testing.While ROCm's PyTorch typically supports
device="cuda"for compatibility, explicitly using a hardcoded CUDA device string is conceptually inconsistent with removing the explicittarget="cuda"from the compile call above. For clearer multi-backend support, consider parameterizing the device based on available hardware.testing/python/language/test_tilelang_language_alloc.py (1)
154-156: Consider more robust initializer assertions.The
kernel_only=Trueparameter and exact count assertions improve test precision. However, the current assertionscode.count("= 1;")andcode.count("= 2;")could fail if other unrelated assignments with these values appear in the generated kernel code.🔎 Suggested improvement for more robust assertions
Consider using more specific patterns that capture the variable name context:
kernel = tilelang.compile(program, out_idx=[1]) code = kernel.get_kernel_source(kernel_only=True) -assert code.count("= 1;") == 1 -assert code.count("= 2;") == 1 +import re +# Match variable initialization patterns more precisely +assert len(re.findall(r'\btmp0\s*=\s*1;', code)) == 1 +assert len(re.findall(r'\btmp1\s*=\s*2;', code)) == 1This ensures we're specifically checking the
tmp0andtmp1initializations rather than any occurrence of= 1;or= 2;.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (54)
.github/workflows/ci.ymlsrc/op/gemm.ccsrc/op/logical.ccsrc/target/codegen_hip.ccsrc/tl_templates/cuda/reduce.hsrc/tl_templates/hip/atomic.hsrc/tl_templates/hip/common.hsrc/tl_templates/hip/debug.hsrc/tl_templates/hip/hip_fp8.hsrc/tl_templates/hip/reduce.htesting/python/autotune/test_tilelang_autotune.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/carver/test_tilelang_carver_cuda_driver_properties.pytesting/python/carver/test_tilelang_carver_recommend_hints.pytesting/python/components/test_storage_rewrite_detect_inplace.pytesting/python/debug/test_device_assert.pytesting/python/debug/test_tilelang_debug_print.pytesting/python/issue/test_tilelang_issue_1001.pytesting/python/issue/test_tilelang_issue_1008.pytesting/python/issue/test_tilelang_issue_830.pytesting/python/issue/test_tilelang_issue_96.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/jit/test_tilelang_jit_gemm_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_parcompile.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/kernel/test_tilelang_kernel_gemm_simt.pytesting/python/kernel/test_tilelang_kernel_int4_gemm_mma.pytesting/python/language/test_tilelang_language_alias.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_annot.pytesting/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_clear.pytesting/python/language/test_tilelang_language_composable_index.pytesting/python/language/test_tilelang_language_copy.pytesting/python/language/test_tilelang_language_frontend_v2.pytesting/python/language/test_tilelang_language_infinity.pytesting/python/language/test_tilelang_language_let.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_ptr.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_var_init.pytesting/python/language/test_tilelang_language_vectorize.pytesting/python/language/test_tilelang_language_vectorized_cast.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytilelang/engine/lower.pytilelang/intrinsics/mfma_layout.pytilelang/intrinsics/mfma_macro_generator.pytilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
- testing/python/issue/test_tilelang_issue_830.py
- testing/python/language/test_tilelang_language_copy.py
- testing/python/language/test_tilelang_language_composable_index.py
🚧 Files skipped from review as they are similar to previous changes (25)
- testing/python/issue/test_tilelang_issue_96.py
- testing/python/components/test_storage_rewrite_detect_inplace.py
- testing/python/language/test_tilelang_language_clear.py
- testing/python/carver/test_tilelang_carver_recommend_hints.py
- tilelang/intrinsics/mfma_macro_generator.py
- src/op/logical.cc
- testing/python/debug/test_tilelang_debug_print.py
- testing/python/issue/test_tilelang_issue_1001.py
- testing/python/issue/test_tilelang_issue_1008.py
- testing/python/language/test_tilelang_language_frontend_v2.py
- testing/python/jit/test_tilelang_jit_parcompile.py
- testing/python/language/test_tilelang_language_let.py
- testing/python/language/test_tilelang_language_var_init.py
- testing/python/language/test_tilelang_language_annot.py
- testing/python/language/test_tilelang_language_infinity.py
- testing/python/kernel/test_tilelang_kernel_gemm_simt.py
- src/target/codegen_hip.cc
- src/op/gemm.cc
- testing/python/language/test_tilelang_language_ptr.py
- testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
- testing/python/debug/test_device_assert.py
- tilelang/engine/lower.py
- src/tl_templates/hip/hip_fp8.h
- testing/python/language/test_tilelang_language_vectorized_cast.py
- testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/autotune/test_tilelang_autotune.pytesting/python/jit/test_tilelang_jit_cutedsl.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_vectorize.pytesting/python/jit/test_tilelang_jit_callback.pytesting/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/language/test_tilelang_language_alias.pytesting/python/jit/test_tilelang_jit_gemm.pytesting/python/cache/test_tilelang_kernel_cache.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_annotate_safe_value.pytesting/python/language/test_tilelang_language_mask_op.pytesting/python/language/test_tilelang_language_alloc.pytesting/python/language/test_tilelang_language_unroll.pytesting/python/language/test_tilelang_language_atomic_add.pytesting/python/language/test_tilelang_language_alias.pytesting/python/kernel/test_tilelang_kernel_gemm.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.pytesting/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.
Applied to files:
testing/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Applied to files:
src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Applied to files:
src/tl_templates/hip/reduce.h
🧬 Code graph analysis (11)
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
tilelang/jit/kernel.py (1)
out_idx(609-610)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
testing/python/language/test_tilelang_language_mask_op.py (1)
tilelang/jit/__init__.py (2)
compile(47-107)compile(347-373)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
tl(110-200)shfl_down(138-140)shfl_down(157-161)shfl_down(182-186)shfl_xor(134-136)shfl_xor(151-155)shfl_xor(176-180)shfl(146-148)shfl(169-173)shfl(194-198)shfl_up(142-144)shfl_up(163-167)shfl_up(188-192)
testing/python/jit/test_tilelang_jit_callback.py (2)
testing/python/cache/test_tilelang_kernel_cache.py (1)
callback(57-60)tilelang/engine/callback.py (2)
register_cuda_postproc_callback(44-74)register_hip_postproc_callback(77-107)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
float32(300-300)
src/tl_templates/hip/common.h (3)
src/tl_templates/cuda/reduce.h (3)
tl(10-83)half_t(17-19)bfloat16_t(20-22)src/tl_templates/cuda/common.h (5)
tl(248-342)tl(571-573)tl(574-576)tl(591-671)bfloat16_t(581-582)src/tl_templates/cuda/atomic.h (1)
half_t(24-26)
src/tl_templates/hip/debug.h (2)
tilelang/language/print_op.py (1)
print_var(15-25)tilelang/language/v2/dtypes.py (2)
short(235-235)long(238-238)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
testing/python/amd/test_tilelang_test_amd.py (1)
run_gemm_rs(195-239)maint/gemm_v2/correctness_evaluation.py (3)
run_gemm_rs(178-210)run_gemm_sr(262-295)run_gemm_rr(350-383)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
run(121-181)run(188-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (54)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
169-169: LGTM: Improved numerical precision.Upgrading the accumulator dtype to
T.float32while keeping fp16 inputs/outputs improves numerical stability by reducing accumulation errors during matrix multiplication. This aligns with industry best practices for GEMM operations.
211-211: LGTM: Consistent precision upgrades across all test variants.The accumulator dtype has been consistently upgraded to
T.float32across all test scenarios (do_bench, multi_stream, dynamic_shape variants). This ensures uniform numerical behavior and improved accuracy throughout the test suite.Also applies to: 255-255, 304-308, 357-357
testing/python/jit/test_tilelang_jit_gemm.py (1)
106-106: LGTM: Test name and configuration properly updated.The test function has been appropriately renamed to
test_gemm_f16f16f32_nn_kernel_jitto reflect the dtype configuration (fp16 inputs, fp16 outputs, fp32 accumulation), and the accumulator dtype parameter has been consistently updated toT.float32. This change aligns with the precision upgrades across the test suite.Also applies to: 115-115
testing/python/language/test_tilelang_language_alias.py (1)
48-48: This follows the standard codebase pattern; no additional verification needed for automatic target selection.The compile call without explicit
targetparameter is consistent with the existing codebase convention. Most tests in the repository rely on automatic target detection, so this change aligns with standard practice rather than introducing a new mechanism. The test will work with the default target available in the environment.testing/python/language/test_tilelang_language_unroll.py (3)
20-21: Good addition of TODO comment and CUDA-gating decorator.The TODO comment clearly documents the HIP limitation for unroll factor, and the
@tilelang.testing.requires_cudadecorator appropriately gates this test to CUDA-only environments. This is consistent with the PR's goal of enabling ROCm CI while properly handling platform-specific limitations.
32-33: Consistent removal of explicit target parameter.The removal of the explicit
target="cuda"parameter is consistent withtest_unroll_with_step(line 16) and aligns with the broader PR pattern of relying on environment-based default targets. Since this test is gated with@tilelang.testing.requires_cuda, the default target will correctly resolve to CUDA when the test executes.
16-17: No changes needed. The test design is correct as is.The basic unroll with step (line 16) is intentionally not gated to CUDA because it is supported on both CUDA and HIP. HIP codegen does generate
#pragma unrollfor unrolled loops, making the assertion valid on both platforms. The unroll_factor variant (line 22) is correctly gated to CUDA-only because HIP does not support the#pragma unroll <N>form—as documented in the TODO comment on line 20..github/workflows/ci.yml (1)
387-388: LGTM: Temporary test exclusion for ROCm.The changes appropriately exclude the runtime and transform test directories from ROCm CI runs while they're being repaired, and broaden test discovery from
./python/amdto./pythonwith explicit ignores. This is a reasonable temporary solution.testing/python/language/test_tilelang_language_vectorize.py (1)
127-127: LGTM: Appropriate CUDA gating for vectorization test.The decorator correctly gates this vectorization test to CUDA-enabled environments, aligning with the PR's objective to partition tests by backend capability.
testing/python/autotune/test_tilelang_autotune.py (3)
262-262: LGTM: CUDA gating for autotuner configuration test.The decorator appropriately restricts this test to CUDA-enabled environments, consistent with the broader PR pattern.
268-268: LGTM: CUDA gating for autotuner matmul test.The decorator correctly restricts this test to CUDA-enabled environments.
250-254: Removal of explicittarget="auto"is safe and preserves behavior.The
targetparameter defaults toNoneinset_compile_args, which triggers fallback to theTILELANG_TARGETenvironment variable that itself defaults to"auto". Removing the explicittarget="auto"parameter does not change behavior—it relies on the same default mechanism that produces identical results. The test now uses the environment variable defaults instead of explicit parameter passing, both of which resolve to the same target selection strategy.Note: The example file
examples/gemm/example_gemm_autotune.pystill explicitly specifiestarget="auto"at line 153, maintaining consistency with the documented pattern, though both approaches are functionally equivalent.testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (4)
156-156: LGTM: CUDA gating for sparse GEMM test.The decorator appropriately restricts the
test_gemm_sssparse GEMM test to CUDA-enabled environments.
307-307: LGTM: CUDA gating for register-shared sparse GEMM test.The decorator appropriately restricts the
test_gemm_rstest to CUDA-enabled environments.
458-458: LGTM: CUDA gating for shared-register sparse GEMM test.The decorator appropriately restricts the
test_gemm_srtest to CUDA-enabled environments.
613-613: LGTM: CUDA gating for register-register sparse GEMM test.The decorator appropriately restricts the
test_gemm_rrtest to CUDA-enabled environments.testing/python/jit/test_tilelang_jit_cutedsl.py (4)
158-158: LGTM: CUDA gating for CuTeDSL JIT kernel test.The decorator appropriately restricts this CuTeDSL compilation test to CUDA-enabled environments, which is necessary since CuTeDSL is a CUDA-specific backend.
210-210: LGTM: CUDA gating for CuTeDSL benchmarking test.The decorator correctly restricts the benchmarking test to CUDA-enabled environments.
253-253: LGTM: CUDA gating for CuTeDSL multi-stream test.The decorator appropriately restricts the multi-stream test to CUDA-enabled environments.
303-303: LGTM: CUDA gating for CuTeDSL dynamic shape test.The decorator correctly restricts the dynamic shape test to CUDA-enabled environments.
testing/python/cache/test_tilelang_kernel_cache.py (1)
121-121: LGTM! CUDA decorator is appropriate here.This test executes GPU kernels (lines 190-191) and creates CUDA tensors (lines 187-188), so the
@tilelang.testing.requires_cudadecorator is correctly applied.testing/python/jit/test_tilelang_jit_nvrtc.py (1)
159-159: LGTM! CUDA decorators are appropriate for NVRTC tests.All six tests use the NVRTC (NVIDIA Runtime Compiler) backend which is CUDA-specific. The tests also create CUDA tensors and use CUDA-specific APIs, so the
@tilelang.testing.requires_cudadecorators are correctly applied.Also applies to: 211-211, 254-254, 304-304, 376-376, 390-390
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)
114-118: LGTM! Improved accumulator precision for float16 inputs.Changing
dtypeAccumfromT.float16toT.float32improves numerical stability and reduces accumulation errors in GEMM operations, which is a best practice for mixed-precision computation.
271-275: LGTM! Improved accumulator precision for float16 inputs.Changing
dtypeAccumfromT.float16toT.float32improves numerical stability, consistent with the changes in other GEMM variants.
427-431: LGTM! Improved accumulator precision for float16 inputs.Changing
dtypeAccumfromT.float16toT.float32improves numerical stability, consistent with other GEMM variants.
590-593: LGTM! Improved accumulator precision for float16 inputs.Changing
dtypeAccumfromT.float16toT.float32improves numerical stability, consistent with other GEMM variants.src/tl_templates/hip/debug.h (6)
6-21: LGTM: Clean fallback implementation.The primary template provides a safe fallback by printing the address for unknown types, maintaining a consistent interface for both
print_varandprint_buffer.
60-74: LGTM: Clear bool representation.The bool specialization correctly prints
"true"or"false"strings instead of numeric values, making debug output more readable.
76-90: LGTM: Safe pointer printing.The pointer specialization correctly uses
%pformat and casts tovoid*, following best practices for printing pointer addresses.
92-101: LGTM: Clean delegation to trait system.The public interface functions correctly delegate to the trait system, maintaining a clean separation between the API and implementation details.
53-54: Types half_t and bfloat16_t are properly defined and support float conversion.Both types are correctly aliased in
src/tl_templates/hip/common.h:half_tmaps tofloat16_tandbfloat16_tmaps tohip_bfloat16. These are standard HIP types with implicit conversion to float, making the print trait definitions valid.
4-4: The FP8 type definitions inhip_fp8.hare properly defined. Bothfp8_e4_tandfp8_e5_tare correctly aliased to HIP's native FP8 types (__hip_fp8_e4m3_fnuzand__hip_fp8_e5m2_fnuzrespectively from<hip/amd_detail/amd_hip_fp8.h>), which natively support conversion tofloat. The macro specializations at lines 56-57 indebug.hwill work correctly with these types.src/tl_templates/cuda/reduce.h (1)
177-186: LGTM: Correct signature and indexing fixes.The changes properly:
- Update
CumSum2D::runto returnvoidsince results are written directly todst- Use the
SEGtemplate parameter forlaneandrowcomputation instead of hardcoded32, ensuring consistency with the segment sizeNote: The past review comment about
src/op/reduce.ccline 570 usingdst->dtypeinstead ofDataType::Handle()for the call to this void-returning function remains relevant and should be addressed separately.src/tl_templates/hip/atomic.h (1)
64-104: LGTM: Vectorized atomics match expected per-element semantics.The
AtomicAddx2/x4andAtomicAddx2Ret/x4Retimplementations correctly perform per-element atomic operations. The*Retvariants return the old values from each individual atomic, which is the expected behavior (not a single atomic operation on the entire vector).src/tl_templates/hip/common.h (1)
132-148: LGTM: HIP shuffle wrappers with noted sync limitation.The TODO on line 132 correctly notes that ROCm 7.1.1+ will provide
shfl_syncvariants. The current implementation using non-sync HIP intrinsics is appropriate for current ROCm versions.The generic templates delegate to raw HIP intrinsics, with specializations (lines 151-198) handling
half_tandbfloat16_tvia float conversion, matching the pattern in the CUDA counterpart.src/tl_templates/hip/reduce.h (5)
76-77: LGTM: Correct use of type-safe shuffle wrapper.Using
tl::shfl_downinstead of the raw__shfl_downintrinsic ensures proper handling ofhalf_tandbfloat16_ttypes through the float-conversion specializations defined incommon.h.
107-108: LGTM: Consistent use oftl::shfl_xorwrapper.
117-182: LGTM: CumSum1D implementation correctly uses HIP warp size.The implementation properly:
- Uses
SEG=64default matching AMD's warp/wavefront size- Employs
tl::shfl_*wrappers throughout for type safety- Handles both forward and reverse prefix-sum modes
- Early-exits threads beyond SEG (line 129-130) since only one warp performs the scan
184-260: LGTM: CumSum2D implementation with axis-aware indexing.The implementation correctly:
- Supports cumulative sum along either axis via the
Axistemplate parameter- Uses proper real_row/real_col mapping based on axis selection
- Employs
tl::shfl_*wrappers for type-safe shuffling- Handles boundary conditions and early exits appropriately
273-291: LGTM: Convenience wrappers for common reduction operations.These wrappers correctly delegate to
warp_reducewith the appropriate operator. They will function correctly once the underlyingwarp_reduceissue (usingtl::shfl_xor) is addressed.testing/python/jit/test_tilelang_jit_callback.py (2)
4-4: LGTM! Import addition mirrors the CUDA callback pattern.The import of
register_hip_postproc_callbackis necessary for HIP/ROCm support and follows the existing pattern for CUDA callbacks.
93-96: HIP callback correctly mirrors the CUDA callback pattern.The HIP post-processing callback is implemented consistently with the CUDA callback at lines 88-91. Both callbacks are registered in the same scope and will coexist without conflict (the runtime determines which is invoked based on the target).
Note:
test_cuda_postproc_callbackis skipped (line 107), andtest_gemm_jit_kerneldoesn't verify callback behavior—it only tests computational correctness. Consider enabling or adding tests that verify the HIP callback is invoked on ROCm hardware.tilelang/intrinsics/mfma_layout.py (3)
3-3: LGTM!The import addition is necessary for the
const(0)usage in the layout functions.
18-20: Same dtype verification applies here.This change mirrors line 9. Ensure the default int32 dtype for
const(0)is appropriate for ROCm compilation.
7-9: This change is intentional for HIP/ROCm compatibility and requires no modifications.The use of
const(0)was introduced specifically to fix HIP build issues. The default int32 dtype is appropriate for layout index calculations, where boththread_id(computed from arithmetic) and the constant offset are int32-compatible. The consistent pattern across both functions confirms this is the correct approach for ROCm targets.tilelang/jit/adapter/wrapper.py (1)
653-656: LGTM! HIP-specific declaration extraction is well-designed.The override correctly handles HIP kernel syntax by splitting at the opening brace instead of a semicolon. The comment clearly explains why this differs from CUDA, and the approach aligns with HIP's inline kernel definition style.
💡 Optional: Add defensive check
For added robustness, you could verify the "{" exists:
def get_declaration(self, declare_kernel_code: str) -> str: # HIP code dont have function declaration, so we use '{\n' to split # __global__ void __launch_bounds__(128) kernel_kernel(float* __restrict__ A) {\n - return declare_kernel_code.split("{")[0] + parts = declare_kernel_code.split("{", 1) + if len(parts) < 2: + logger.warning(f"Expected opening brace in HIP kernel, got: {declare_kernel_code[:100]}") + return parts[0]testing/python/language/test_tilelang_language_atomic_add.py (4)
210-211: LGTM! Consistent dtype parameterization.The dtype parameter propagation follows the established pattern used in other test runner functions throughout the file.
238-238: LGTM! Appropriate CUDA gating added.The
@requires_cudadecorators correctly gate CUDA-specific atomic operations, preventing test failures on non-CUDA platforms.Also applies to: 243-243
248-250: LGTM! Well-formed half-precision test.The test correctly exercises
atomic_addx2with float16, includes the required@requires_cudadecorator, and follows the established test pattern.
353-353: LGTM! Proper CUDA gating for memory-order tests.The decorator correctly gates CUDA-specific atomic memory ordering operations.
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
31-31: Default target selection is correctly implemented. The removal of explicittarget="cuda"is appropriate—thetilelang.compile()function uses environment variableTILELANG_TARGET(defaults to"auto"), which triggersdetermine_target()to intelligently detect available backends (CUDA, then HIP, then Metal). PyTorch'sdevice="cuda"is compatible with ROCm systems via the CUDA compatibility layer, and@tilelang.testing.requires_cudafrom TVM's testing utilities works correctly on both CUDA and ROCm. The code properly supports multi-backend compilation as intended.testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
168-168: LGTM: Accumulation dtype updated to float32 for numerical stability.The change from
T.float16toT.float32fordtypeAccumacross all test invocations aligns with the reference implementation which usestorch.float(float32) for matmul computation. Accumulating in higher precision is a standard practice to avoid numerical errors.Also applies to: 211-211, 253-253, 302-307
testing/python/language/test_tilelang_language_alloc.py (2)
118-121: LGTM - Appropriate CUDA gating for unsupported ROCm tests.The
@tilelang.testing.requires_cudadecorator correctly restricts these tests to CUDA until ROCm support is implemented. The TODO comment clearly documents the temporary limitation.
36-36: Both"tmp ="and"tmp[0] ="are valid code generation patterns for single-element buffers. Sincealloc_varcreates a buffer with shape[1], different backend optimization passes may generate either the scalar form or the explicit array indexing form. The OR assertion correctly accommodates this legitimate variation and is appropriate for supporting multiple backends with different codegen strategies.
| T.float16, | ||
| T.float16, | ||
| T.float16, | ||
| T.float32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the test file to understand the context
cat -n testing/python/jit/test_tilelang_jit_callback.py | head -150Repository: tile-ai/tilelang
Length of output: 4675
🏁 Script executed:
# Also check around line 232
cat -n testing/python/jit/test_tilelang_jit_callback.py | sed -n '220,240p'Repository: tile-ai/tilelang
Length of output: 571
🏁 Script executed:
# Search for float16 accumulation patterns in the codebase
rg -n 'accum_dtype.*float16' testing/ --type pyRepository: tile-ai/tilelang
Length of output: 549
🏁 Script executed:
# Search for float16 in gemm or matmul related tests
rg -n -B 5 -A 5 'float16.*float16.*float16' testing/ --type py | head -100Repository: tile-ai/tilelang
Length of output: 8666
🏁 Script executed:
# Look for ROCm-related comments or issues mentioning dtype precision
rg -n -i 'rocm.*precision|rocm.*dtype|rocm.*float16|rocm.*float32|mfma.*precision|fp8.*accumul' . --type py --type md --type txt | head -50Repository: tile-ai/tilelang
Length of output: 883
The accumulation dtype change to float32 is inconsistent with other GEMM tests in the codebase.
This change is specific to these two test functions, while other GEMM tests (e.g., test_tilelang_kernel_gemm_mma_intrinsic.py, test_tilelang_jit_nvrtc.py, test_tilelang_jit_cutedsl.py) continue to use float16 for accumulation. The active test test_gemm_jit_kernel uses loose tolerances (atol=1e-2, rtol=1e-2, max_mismatched_ratio=0.05), suggesting this change addresses a numerical precision issue specific to these tests.
Clarify:
- Is this change specific to callback/JIT compilation mechanisms, or should similar updates apply to other GEMM test variants?
- If callback/JIT-specific, document why float32 accumulation is required for these paths.
🤖 Prompt for AI Agents
In testing/python/jit/test_tilelang_jit_callback.py around line 117, the
accumulation dtype was changed to T.float32 which is inconsistent with other
GEMM tests that use float16; either revert the accumulation dtype back to
float16 to match the rest of the GEMM test suite, or if float32 is required for
callback/JIT paths, add a concise comment above the test (or a docstring)
explaining that this path needs higher-precision accumulation, why (e.g.,
numerical instability in callback/JIT compilation), and adjust the test metadata
(tolerance or a targeted marker) so the deviation is explicit; after making the
chosen change, run the related GEMM tests to confirm behavior is consistent.
| ) | ||
|
|
||
|
|
||
| @tilelang.testing.requires_cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at line 387 and surrounding context in the test file
sed -n '380,400p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 666
🏁 Script executed:
# Let's also check the entire structure of test functions in the file
rg -n '^\s*def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | head -20Repository: tile-ai/tilelang
Length of output: 323
🏁 Script executed:
# Check what decorators are applied to test functions
rg -n -B 2 'def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E '(@|def test_)' | head -30Repository: tile-ai/tilelang
Length of output: 359
🏁 Script executed:
# Check test_gemm_jit_kernel (starting at line 159)
sed -n '159,209p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1092
🏁 Script executed:
# Check test_tvm_ffi_kernel_do_bench (starting at line 210)
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1223
🏁 Script executed:
# Check test_tvm_ffi_kernel_multi_stream (starting at line 252)
sed -n '252,300p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1438
🏁 Script executed:
# Check test_tvm_ffi_dynamic_shape (starting at line 301)
sed -n '301,373p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 3152
🏁 Script executed:
# Check test_tvm_ffi_im2col_tma_desc (starting at line 374)
sed -n '374,387p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 523
🏁 Script executed:
# Check test_tvm_ffi_pdl (starting at line 455)
sed -n '455,500p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1417
🏁 Script executed:
# Find run_gemm_jit_kernel function
rg -n "def run_gemm_jit_kernel" testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 88
🏁 Script executed:
# Check the beginning of the file for imports and setup
head -100 testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 3240
🏁 Script executed:
# Look for pytest.skip or skip mechanisms at test start
rg -n "pytest.skip|pytest.mark.skipif" testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 205
🏁 Script executed:
# Check the run_gemm_jit_kernel function
sed -n '102,158p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1165
🏁 Script executed:
# Check if there are any global CUDA checks at the module level
grep -n "cuda\|requires_cuda" testing/python/jit/test_tilelang_jit_tvm_ffi.py | head -20Repository: tile-ai/tilelang
Length of output: 1252
🏁 Script executed:
# Check the file structure - look for conftest or any module-level skips
head -150 testing/python/jit/test_tilelang_jit_tvm_ffi.py | tail -50Repository: tile-ai/tilelang
Length of output: 988
🏁 Script executed:
# Check if test_tvm_ffi_kernel_do_bench uses CUDA-specific features through profiler
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E "(cuda|\.cuda\(\))"Repository: tile-ai/tilelang
Length of output: 305
🏁 Script executed:
# Verify what decorators are on test_gemm_jit_kernel and test_tvm_ffi_kernel_do_bench
sed -n '155,165p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 284
🏁 Script executed:
# Check if there's a conftest.py that might skip these tests globally
find testing/python/jit -name "conftest.py" -exec cat {} \;Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check what profiler is and whether do_bench requires CUDA
rg -n "get_profiler\|do_bench" testing/python/jit/test_tilelang_jit_tvm_ffi.py -B 2 -A 2Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check if tvm_ffi backend execution happens in test_tvm_ffi_kernel_do_bench
sed -n '225,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 789
🏁 Script executed:
# Check the full test_tvm_ffi_kernel_do_bench to see if profiler execution happens
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.pyRepository: tile-ai/tilelang
Length of output: 1223
🏁 Script executed:
# List all test functions and their decorators/gates
rg -B 3 '^\s*def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E '(@|def test_|pytest.skip)' | head -40Repository: tile-ai/tilelang
Length of output: 327
Add @tilelang.testing.requires_cuda decorator to ungated tests using CUDA APIs.
The @tilelang.testing.requires_cuda decorator is correctly applied to test_tvm_ffi_l2_persistent_map. However, several other tests in this file use CUDA-specific APIs without proper gating:
test_gemm_jit_kernel: callsrun_gemm_jit_kernelwhich allocates tensors with.cuda()test_tvm_ffi_kernel_do_bench: callsrun_tvm_ffi_kernel_do_benchwhich allocates tensors with.cuda()test_tvm_ffi_kernel_multi_stream: explicitly usestorch.cuda.Stream()and.cuda()test_tvm_ffi_dynamic_shape: allocates tensors with.cuda()
These tests will fail on non-CUDA systems. Add @tilelang.testing.requires_cuda to each of these test functions to skip them appropriately. Consider also standardizing the other tests (test_tvm_ffi_im2col_tma_desc, test_tvm_ffi_pdl) which currently use custom pytest.skip() gates instead of the decorator.
🤖 Prompt for AI Agents
In testing/python/jit/test_tilelang_jit_tvm_ffi.py around line 387, several
tests that use CUDA APIs are not properly gated: add the
@tilelang.testing.requires_cuda decorator above test_gemm_jit_kernel,
test_tvm_ffi_kernel_do_bench, test_tvm_ffi_kernel_multi_stream, and
test_tvm_ffi_dynamic_shape so they are skipped on non-CUDA systems; also
consider replacing the pytest.skip() guards in test_tvm_ffi_im2col_tma_desc and
test_tvm_ffi_pdl with the same @tilelang.testing.requires_cuda decorator for
consistency.
| profiler.assert_allclose(ref_program, atol=1e-2, rtol=1e-2) | ||
|
|
||
|
|
||
| @tilelang.testing.requires_cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the test file to verify the decorator locations and patterns
cd testing/python/kernel && cat -n test_tilelang_kernel_gemm.py | head -120 | tail -30Repository: tile-ai/tilelang
Length of output: 176
🏁 Script executed:
# Let's see more of the file to understand the test patterns better
cd testing/python/kernel && sed -n '100,300p' test_tilelang_kernel_gemm.py | cat -nRepository: tile-ai/tilelang
Length of output: 176
🏁 Script executed:
# Search for ROCm-related decorators, comments, or configuration in the codebase
rg -n 'requires_rocm|@tilelang.testing' --type=py -C2 | head -100Repository: tile-ai/tilelang
Length of output: 7521
🏁 Script executed:
# Search for any ROCm documentation or issues
rg -n -i 'rocm|amd' --type=py --type=md -C2 | head -80Repository: tile-ai/tilelang
Length of output: 7918
🏁 Script executed:
# Check the structure of decorators in the tilelang.testing module
fd -e py | xargs rg -l 'requires_cuda|requires_rocm' | head -10Repository: tile-ai/tilelang
Length of output: 582
🏁 Script executed:
sed -n '100,280p' testing/python/kernel/test_tilelang_kernel_gemm.py | cat -nRepository: tile-ai/tilelang
Length of output: 4227
🏁 Script executed:
# Check if there's a requires_rocm decorator available and how it's used
rg -n 'def test_gemm' testing/python/kernel/test_tilelang_kernel_gemm.py | head -20Repository: tile-ai/tilelang
Length of output: 499
🏁 Script executed:
# Look at the complete context of line 237 mentioned in the scratchpad
sed -n '230,245p' testing/python/kernel/test_tilelang_kernel_gemm.py | cat -nRepository: tile-ai/tilelang
Length of output: 441
🏁 Script executed:
# Check if there are any GitHub issues or documentation about ROCm support
rg -n -i 'github.com|issue|TODO.*rocm' --type=py testing/python/kernel/test_tilelang_kernel_gemm.pyRepository: tile-ai/tilelang
Length of output: 124
🏁 Script executed:
# Check the full PR context and commit message if available
git log --oneline -20 2>/dev/null | head -10Repository: tile-ai/tilelang
Length of output: 90
🏁 Script executed:
# Look for any related issues or comments about ROCm support
rg -i 'rocm|precision.*issue' testing/python/kernel/test_tilelang_kernel_gemm.py -A2 -B2Repository: tile-ai/tilelang
Length of output: 184
🏁 Script executed:
# Check if there are more TODO comments about ROCm precision
rg -n 'TODO' testing/python/kernel/test_tilelang_kernel_gemm.pyRepository: tile-ai/tilelang
Length of output: 124
Document ROCm incompatibilities consistently across gated tests.
The @tilelang.testing.requires_cuda decorators on these tests appear intentional due to ROCm precision issues, as evidenced by the explicit TODO at line 237 ("Meets precision issue on ROCm, disable for now"). However, this rationale is documented only for test_gemm_f32f32f32_tn. The other gated tests—test_gemm_f16f16f16_nn, test_gemm_f16f16f16_tn, test_gemm_f16f16f16_nt, test_gemm_f64f64f64_nt, test_pad_aligned_f16f16f16_nn, and test_pad_f16f16f16_nn—lack comments explaining their ROCm incompatibility.
Either:
- Add similar TODO comments to the other gated tests if they share the same precision issue, or
- Clarify if these tests are gated for different reasons.
🤖 Prompt for AI Agents
In testing/python/kernel/test_tilelang_kernel_gemm.py around line 106, several
tests are gated with @tilelang.testing.requires_cuda but only
test_gemm_f32f32f32_tn has a TODO explaining ROCm precision issues; update the
other gated tests (test_gemm_f16f16f16_nn, test_gemm_f16f16f16_tn,
test_gemm_f16f16f16_nt, test_gemm_f64f64f64_nt, test_pad_aligned_f16f16f16_nn,
test_pad_f16f16f16_nn) by adding the same TODO comment above their
@requires_cuda decorators if they share the ROCm precision problem, or replace
the decorator/comment with a brief note clarifying the actual reason if it
differs.
| # TODO(Gong): Meets precision issue on ROCm, disable for now | ||
| @tilelang.testing.requires_cuda | ||
| def test_gemm_f32f32f32_tn(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add issue tracker reference and investigate the transpose-specific precision issue.
The TODO comment indicates a known ROCm precision issue but lacks a tracking reference. Additionally, only the TN (transpose A) variant fails, while test_gemm_f32f32f32_nt (line 221) and test_gemm_f32f32f32_nn (line 156) remain ungated, suggesting the issue is specific to the transpose combination rather than the f32 dtype.
Please:
- Create a tracking issue and reference it in the TODO comment (e.g.,
TODO(Gong): #XXXX ROCm precision issue...) - Document what specific precision failure occurs (numerical error magnitude, wrong results, etc.)
- Investigate why only the TN variant fails when NN and NT variants work
Would you like me to generate a script to help identify the precision issue or open a tracking issue?
🤖 Prompt for AI Agents
In testing/python/kernel/test_tilelang_kernel_gemm.py around lines 237-239, the
TODO lacks an issue tracker reference and no details on the ROCm precision
failure that affects only the TN (transpose A) variant; create a tracking issue
in the repo (e.g. #XXXX) and update the TODO to: "TODO(Gong): #XXXX ROCm
precision issue on TN variant — describe observed failure", then run focused
diagnostics capturing failing test outputs (absolute/relative error magnitudes,
mismatched indices/values, and reproduction inputs), add a short comment
summarizing the numeric error (e.g., max abs/rel error and example mismatched
element), and investigate code paths unique to A-transpose (kernel math, memory
layout, stride handling) to identify the root cause or add a targeted test-gate
referencing the new issue if temporary disabling is required.
| @tilelang.testing.requires_cuda | ||
| def test_tilelang_copy(): | ||
| run_tilelang_copy(M=1024, N=1024, block_M=128, block_N=128, pad_value=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the tilelang.testing module and locate requires_cuda decorator
find . -type f -name "*.py" -path "*testing*" | head -20Repository: tile-ai/tilelang
Length of output: 1181
🏁 Script executed:
#!/bin/bash
# Search for requires_cuda decorator definition
rg -n "def requires_cuda" --type=py -A 15Repository: tile-ai/tilelang
Length of output: 2432
🏁 Script executed:
#!/bin/bash
# Search for any GPU-related decorators in testing module
rg -n "requires_.*(?:cuda|gpu)" --type=py -A 10Repository: tile-ai/tilelang
Length of output: 50373
🏁 Script executed:
#!/bin/bash
# Check if there are references to ROCm or HIP in testing utilities
rg -n "rocm|hip|gpu|cuda" --type=py -i -l path="*testing*" | head -10Repository: tile-ai/tilelang
Length of output: 114
🏁 Script executed:
#!/bin/bash
# Find the requires_cuda function definition
rg -n "^def requires_cuda" tilelang/testing/ -A 20Repository: tile-ai/tilelang
Length of output: 2671
🏁 Script executed:
#!/bin/bash
# Get the full tilelang/testing/__init__.py file to see all decorators
wc -l tilelang/testing/__init__.py
head -150 tilelang/testing/__init__.pyRepository: tile-ai/tilelang
Length of output: 3963
🏁 Script executed:
#!/bin/bash
# Check if there are any ROCm or HIP-related decorators
rg -n "rocm|hip" tilelang/testing/ --type=pyRepository: tile-ai/tilelang
Length of output: 261
🏁 Script executed:
#!/bin/bash
# Look at the actual test file that was being reviewed
cat -n testing/python/language/test_tilelang_language_annotate_safe_value.py | head -50Repository: tile-ai/tilelang
Length of output: 2095
Reconcile decorator with ROCm CI goals or make test device-agnostic.
The @tilelang.testing.requires_cuda decorator (from TVM's testing utilities) will skip this test on ROCm systems. Since the PR aims to "Open Rocm ci test", either use @tilelang.testing.requires_rocm for ROCm execution, detect the available GPU device dynamically instead of hardcoding device="cuda" on line 32, or use a more generic approach that supports both CUDA and ROCm. Note that requires_rocm is already available in tilelang.testing (exported from TVM).
🤖 Prompt for AI Agents
testing/python/language/test_tilelang_language_annotate_safe_value.py around
lines 43-45: the test is decorated with @tilelang.testing.requires_cuda which
will skip on ROCm; make the test device-agnostic by either replacing the
decorator with a ROCm-aware check or by detecting the available GPU backend at
runtime and using that device for the test. Concretely, remove or replace
@tilelang.testing.requires_cuda with a combined check (e.g., require either CUDA
or ROCm) or use no decorator and instead determine device = "cuda" if CUDA
available else "rocm" (or skip if neither), then pass that device variable into
run_tilelang_copy (also change the hardcoded device="cuda" on line ~32
accordingly) so the test runs on both CUDA and ROCm CI.
| def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16): | ||
| @T.prim_func | ||
| def atomic_addx2(A: T.Tensor((M, N), T.float16), B: T.Tensor((M, N), T.float16)): | ||
| def atomic_addx2(A: T.Tensor((M, N), dtype), B: T.Tensor((M, N), dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check atomic_addx2 dtype constraints and documentation
echo "=== Searching for atomic_addx2 in atomic.py ==="
fd "atomic.py" | head -5Repository: tile-ai/tilelang
Length of output: 176
🏁 Script executed:
#!/bin/bash
# Find and inspect the atomic.py file
find . -name "atomic.py" -type f | grep -E "tilelang.*atomic\.py"Repository: tile-ai/tilelang
Length of output: 89
🏁 Script executed:
cat ./tilelang/language/atomic.pyRepository: tile-ai/tilelang
Length of output: 18000
🏁 Script executed:
cat -n ./testing/python/language/test_tilelang_language_atomic_add.py | sed -n '190,260p'Repository: tile-ai/tilelang
Length of output: 2571
🏁 Script executed:
grep -n "atomic_addx2_float\|atomic_addx2_half\|test_atomic_addx2" ./testing/python/language/test_tilelang_language_atomic_add.pyRepository: tile-ai/tilelang
Length of output: 128
🏁 Script executed:
# Check if atomic_addx2 is used elsewhere with different dtypes
rg "atomic_addx2" --type=py -B2 -A2Repository: tile-ai/tilelang
Length of output: 5052
Address dtype constraint mismatch between docstring and test.
The atomic_addx2 docstring documents support for FP16/BF16 only (see examples), yet the test file includes test_atomic_addx2_float() which explicitly calls the function with dtype=T.float32. This creates a contradiction: either float32 should be documented as supported, or the test and dtype parameter should be constrained to half-precision types. The implementation currently has no dtype validation to enforce the documented constraints.
Consider one of the following:
- Update the docstring to document float32 support if it is intentionally supported
- Restrict the
dtypeparameter inatomic_addx2_programto only accept FP16/BF16 and move float32 tests to useatomic_addx4(which is designed for quad-width float32) - Add runtime assertions to validate dtype compatibility with the atomic operation
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines
198-200, the test and docstring disagree on supported dtypes: the docstring
states atomic_addx2 supports FP16/BF16 but the test calls it with T.float32; add
a runtime dtype check and align tests/docs by restricting atomic_addx2 to
half-precision only. Update atomic_addx2_program to assert dtype is either
T.float16 or T.bfloat16 (raise a clear error if not), update the docstring to
explicitly list only FP16/BF16, and change test_atomic_addx2_float() to either
use atomic_addx4 for float32 or remove/rename the float32 test to target the
correct quad-width primitive so implementation, tests, and docs remain
consistent.
| kernel = tilelang.compile(program, out_idx=[1], pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True}) | ||
| a = torch.randn(M, N, device="cuda", dtype=getattr(torch, dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the tilelang compile function to understand default target behavior
find . -name "*.py" -type f | head -20Repository: tile-ai/tilelang
Length of output: 1083
🏁 Script executed:
# Search for the compile function definition
rg -n "def compile" --type=py | head -20Repository: tile-ai/tilelang
Length of output: 863
🏁 Script executed:
# Look for TILELANG_TARGET environment variable usage
rg -n "TILELANG_TARGET" --type=py -C2Repository: tile-ai/tilelang
Length of output: 4831
🏁 Script executed:
# Check if test gating decorators exist in the test file
rg -n "@.*cuda\|@.*hip\|@.*requires\|skipif\|pytest.mark" testing/python/language/test_tilelang_language_mask_op.pyRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Search for test gating decorators in other language test files for comparison
rg -n "@.*cuda\|@.*hip\|@.*requires\|skipif\|pytest.mark" testing/python/language/ --type=py | head -30Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Get the compile function signature and relevant code
sed -n '47,120p' tilelang/jit/__init__.pyRepository: tile-ai/tilelang
Length of output: 3255
🏁 Script executed:
# Check the implementation of how target="auto" is handled
rg -n "auto.*target\|target.*auto" tilelang/jit/__init__.py -C3 | head -40Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Check the actual test file to see its structure and decorators
cat -n testing/python/language/test_tilelang_language_mask_op.pyRepository: tile-ai/tilelang
Length of output: 6381
🏁 Script executed:
# Search for how target resolution works - check for "auto" handling in backends
rg -n "def.*resolve.*target|target.*==.*auto|if.*auto" tilelang/ --type=py -C2 | head -50Repository: tile-ai/tilelang
Length of output: 3204
🏁 Script executed:
# Look for test gating patterns in other Python test files
find testing/python -name "*.py" -type f | head -10Repository: tile-ai/tilelang
Length of output: 580
🏁 Script executed:
# Search for pytest.mark.skipif or CUDA-specific decorators in test files
rg -n "pytest.mark.skipif|@.*cuda|@.*rocm|@.*requires_gpu" testing/python/ --type=py | head -20Repository: tile-ai/tilelang
Length of output: 2016
🏁 Script executed:
# Check if there's a pattern for conditional test decorators
rg -n "def.*test_|class.*Test" testing/python/language/ --type=py -A2 | head -40Repository: tile-ai/tilelang
Length of output: 2966
🏁 Script executed:
# Check if there's a tilelang.testing.requires_cuda decorator and how it's defined
rg -n "requires_cuda|requires_rocm" tilelang/testing/ --type=py | head -20Repository: tile-ai/tilelang
Length of output: 1837
🏁 Script executed:
# Check other language test files to see if they have decorators
ls -la testing/python/language/test_*.py | head -10Repository: tile-ai/tilelang
Length of output: 1135
🏁 Script executed:
# Check if any language tests have been gated
rg -n "@.*requires" testing/python/language/ --type=pyRepository: tile-ai/tilelang
Length of output: 5286
🏁 Script executed:
# Look at a language test file that's similar to understand the pattern
cat -n testing/python/language/test_tilelang_language_vectorized_cast.py | head -120Repository: tile-ai/tilelang
Length of output: 4956
Add test gating decorators to CUDA-specific tests.
The test functions hardcode device="cuda" in torch tensor creation but lack @tilelang.testing.requires_cuda decorators. This will cause tests to fail on ROCm/HIP systems. Add the decorator to all four test functions (test_tilelang_copy_mask_parallel, test_tilelang_copy_mask_copy, test_tilelang_copy_mask_parallel_range, test_tilelang_copy_mask_copy_range) to match the pattern used throughout other language tests and properly gate them to CUDA-only execution.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_mask_op.py around lines 31-32,
the tests create tensors with device="cuda" but lack CUDA gating; add the
@tilelang.testing.requires_cuda decorator above each of the four test functions:
test_tilelang_copy_mask_parallel, test_tilelang_copy_mask_copy,
test_tilelang_copy_mask_parallel_range, and test_tilelang_copy_mask_copy_range
so these tests only run on CUDA-enabled environments and avoid failures on
ROCm/HIP systems.
| "float8_e5m2fnuz": "fp8_e5_t", | ||
| "float8_e4m3fnuz": "fp8_e4_t", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the wrapper.py file to see the full context of the type maps
head -n 700 tilelang/jit/adapter/wrapper.py | tail -n 120Repository: tile-ai/tilelang
Length of output: 4344
🏁 Script executed:
# Search for uint64 usage in the codebase
rg -n "uint64" --type=py -C 2Repository: tile-ai/tilelang
Length of output: 30117
🏁 Script executed:
# Search for float8_e5m2fnuz and float8_e4m3fnuz to verify they are ROCm-specific
rg -n "float8_e5m2fnuz|float8_e4m3fnuz" --type=py -C 1Repository: tile-ai/tilelang
Length of output: 22364
🏁 Script executed:
# Check if there are differences documented between CUDA and HIP type support
rg -n "TLCUDASourceWrapper|TLHIPSourceWrapper" --type=py -A 5 | head -n 100Repository: tile-ai/tilelang
Length of output: 4074
🏁 Script executed:
# Get the complete TLCUDASourceWrapper._TYPE_MAP definition
sed -n '168,210p' tilelang/jit/adapter/wrapper.pyRepository: tile-ai/tilelang
Length of output: 1378
Add "uint64": "uint64_t" to the CUDA type map for consistency.
The HIP type map includes "uint64": "uint64_t", but the parent TLCUDASourceWrapper._TYPE_MAP does not have this mapping. Since uint64 is a valid TileLang type and CUDA kernels can use it, add this mapping to the CUDA type map to maintain consistency across backends. The CPU backend also includes this mapping.
Diff
"float8_e5m2": "fp8_e5_t",
"float64": "double",
"int64": "int64_t",
"int32": "int",
"uint32": "unsigned int",
+ "uint64": "uint64_t",
"bool": "int8_t",
🤖 Prompt for AI Agents
In tilelang/jit/adapter/wrapper.py around lines 626-627, the CUDA type map is
missing the "uint64": "uint64_t" entry which exists in the HIP and CPU maps; add
the mapping "uint64": "uint64_t" alongside the other integer mappings (e.g.,
next to "uint32": "unsigned int") so CUDA kernels can use TileLang uint64 types
and the backends remain consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)
158-173: Several tests still lack CUDA gating decorators.While decorators were added to
test_tvm_ffi_l2_persistent_mapandtest_tvm_ffi_pdl, the following tests continue to use CUDA APIs without proper gating:
test_gemm_jit_kernel(line 158): callsrun_gemm_jit_kernelwhich uses.cuda()on lines 137-138test_tvm_ffi_kernel_do_bench(line 209): uses the TVM FFI backend which requires CUDAtest_tvm_ffi_kernel_multi_stream(line 251): explicitly usestorch.cuda.Stream()and.cuda()on lines 235-236, 242, 246test_tvm_ffi_dynamic_shape(line 300): uses.cuda()on lines 285-286, 292Add
@tilelang.testing.requires_cudadecorator to these test functions to ensure they skip gracefully on non-CUDA systems.Also applies to: 209-211, 251-253, 300-307
🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_cython.py (1)
8-9: Optional: Consider removing redundant decorator.The
@tilelang.testing.requires_cudadecorator on line 8 may be redundant, as@tilelang.testing.requires_cuda_compute_version(9, 0)on line 9 already includesrequires_cudamarks internally (as shown intilelang/testing/__init__.py). While this doesn't cause issues, removing the redundant decorator would simplify the code.🔎 Proposed simplification
-@tilelang.testing.requires_cuda @tilelang.testing.requires_cuda_compute_version(9, 0) def test_cython_pdl():testing/python/jit/test_tilelang_jit_nvrtc.py (2)
375-386: Refactor to use decorator pattern for consistency.The
test_nvrtc_im2col_tma_descfunction uses manualcheck_hopper()+pytest.skip()(lines 378-381) to gate on compute capability 9.0, whiletest_nvrtc_pdl(lines 441-442) uses the@tilelang.testing.requires_cuda_compute_version(9, 0)decorator for the same purpose. This inconsistency makes the codebase harder to maintain.🔎 Proposed refactor for consistency
@tilelang.testing.requires_cuda +@tilelang.testing.requires_cuda_compute_version(9, 0) def test_nvrtc_im2col_tma_desc(): """Test im2col TMA descriptor with NVRTC backend.""" - if not check_hopper(): - import pytest - - pytest.skip("Test requires Hopper GPU (compute capability 9.0)") - # Small test case for im2col TMA descriptor run_nvrtc_im2col_tma_desc(
441-442: Optional: Consider removing redundant decorator.The
@tilelang.testing.requires_cudadecorator on line 441 may be redundant, as@tilelang.testing.requires_cuda_compute_version(9, 0)on line 442 already includesrequires_cudamarks internally. While this doesn't cause issues, removing the redundant decorator would simplify the code.🔎 Proposed simplification
-@tilelang.testing.requires_cuda @tilelang.testing.requires_cuda_compute_version(9, 0) def test_nvrtc_pdl():
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
testing/python/jit/test_tilelang_jit_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.pytesting/python/jit/test_tilelang_jit_tvm_ffi.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/jit/test_tilelang_jit_tvm_ffi.pytesting/python/jit/test_tilelang_jit_cython.pytesting/python/jit/test_tilelang_jit_nvrtc.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.
Applied to files:
testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)
tilelang/language/v2/dtypes.py (2)
float32(300-300)float16(299-299)tilelang/language/symbolics.py (1)
dynamic(12-29)tilelang/testing/__init__.py (1)
requires_cuda_compute_version(40-104)
testing/python/jit/test_tilelang_jit_cython.py (1)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version(40-104)
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version(40-104)
🔇 Additional comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
167-167: LGTM: Improved numerical stability with float32 accumulation.Switching from float16 to float32 accumulation dtype across these tests is a good practice. Float32 accumulation provides better numerical stability and reduces accumulation errors in matrix multiplication operations while maintaining float16 for input/output tensors.
Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306
386-386: LGTM: Appropriate CUDA gating decorators.The decorators are correctly applied:
test_tvm_ffi_l2_persistent_mapproperly gates on CUDA availability since it uses.cuda()tensors.test_tvm_ffi_pdlproperly gates on both CUDA availability and Hopper architecture (9.0) since PDL (Producer-Driven Launch) is a Hopper-specific feature.Also applies to: 446-447
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
158-158: LGTM!The
@tilelang.testing.requires_cudadecorators correctly gate these tests to run only on CUDA-enabled environments. The pattern is consistent and appropriate for tests that require CUDA support but don't need specific compute capability versions.Also applies to: 210-210, 253-253, 303-303, 389-389
Summary by CodeRabbit
New Features
Tests
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.