Skip to content

Conversation

@Gongen-Ali
Copy link
Collaborator

@Gongen-Ali Gongen-Ali commented Dec 16, 2025

Summary by CodeRabbit

  • New Features

    • Broader HIP support: warp reductions, device-side atomic helpers, 1D/2D cumulative-sum utilities, unified device debug printing, and expanded FP8 datatype support.
  • Tests

    • Many GPU tests now gated to CUDA; several tests updated to use higher-precision accumulation; new FP8 test variants added; tests rely more on default backend selection.
  • Refactor

    • Consolidated HIP/CUDA utilities and moved include/template path resolution to environment-driven configuration.
  • Chores

    • CI adjusted to skip specific ROCm test paths.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

Adds HIP/ROCm FP8/MFMA and warp-reduce support, refactors HIP atomics and debug printing into trait/util namespaces, extends HIP codegen (ptx_cp_async, warp_reduce intrinsics), updates JIT wrapper/type maps and engine include-path resolution, and applies broad test gating/accum-dtype updates.

Changes

Cohort / File(s) Summary
CI Workflow
/.github/workflows/ci.yml
ROCm pytest invocation comment updated; test path changed to ./python and --ignore=./python/runtime --ignore=./python/transform added.
GEMM & Logical ops
src/op/gemm.cc, src/op/logical.cc
Add CDNA case (kNPerWarp=16) to GEMM warp partition; register hip.FLowerIntrinsic for tl.any_of / tl.all_of.
HIP Codegen & Intrinsics
src/target/codegen_hip.cc
Recognize tl::ptx_cp_async (3-/4-arg forms), simplify predicate handling, extend FP8 dtype map, and add codegen for tl::warp_reduce_* intrinsics.
HIP Templates — Atomics & Common
src/tl_templates/hip/atomic.h, src/tl_templates/hip/common.h, src/tl_templates/hip/hip_fp8.h
New device atomic helpers with optional memory_order; move shuffle/Any/All into tl:: namespace; remove global AtomicAdd overload; add include guard to hip_fp8.h.
Reduce / Shuffle Utilities
src/tl_templates/hip/reduce.h, src/tl_templates/cuda/reduce.h
Introduce CumSum1D/CumSum2D and warp_reduce_* utilities; replace __shfl_* with tl::shfl_*; change CumSum2D::run return type to void and update indexing/reverse logic.
HIP Debug Print Refactor
src/tl_templates/hip/debug.h
Replace per-type debug_print specializations with PrintTraits<T> trait system and macro-generated specializations; add #include "hip_fp8.h".
Source Wrappers & Type Maps
tilelang/jit/adapter/wrapper.py
Add CUDA/HIP get_declaration extraction paths; extend HIP _TYPE_MAP with "float8_e5m2fnuz": "fp8_e5_t" and "uint64": "uint64_t".
MFMA Layout / Macro Generator
tilelang/intrinsics/mfma_layout.py, tilelang/intrinsics/mfma_macro_generator.py
Use const(0) instead of convert(0) in layouts; add support for float8_e5m2fnuz in MFMA k-dim/suffix logic.
Engine Lowering — Include Paths
tilelang/engine/lower.py
Replace os-based path resolution with environment-driven include paths from tilelang.env.
Tests & Harness
testing/python/**
Wide test updates: many tests gated with @tilelang.testing.requires_cuda, many target="cuda" removals (use default), accumulation dtype changes (float16→float32), added FP8 test variants and small assertion/config tweaks.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer / CI
  participant JIT as JIT Wrapper
  participant Lower as Engine Lower
  participant Codegen as HIP Codegen
  participant Device as GPU

  Note over Dev,JIT: submit kernel source / tests
  Dev->>JIT: provide kernel source
  JIT->>JIT: extract declaration (get_declaration)  %% new CUDA/HIP paths
  JIT->>Lower: send declaration + type map (incl. FP8)
  Lower->>Lower: resolve include paths (tilelang.env) & MFMA/FP8 mapping
  Lower->>Codegen: request HIP emission (ptx_cp_async, warp_reduce, atomics)
  Codegen->>Codegen: generate HIP kernel using tl::shfl / tl::warp_reduce / atomics
  Codegen->>JIT: return emitted kernel source
  JIT->>Device: compile & launch
  Device-->>Dev: runtime results / tests
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • LeiWang1999
  • Rachmanino

"I'm a rabbit in the code glen,
HIP and CUDA hop together again.
FP8 seeds tucked in every night,
Atomics hum, reductions take flight,
Tests gated, templates shining bright 🐇"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.04% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main purpose of the changeset: enabling ROCm CI tests and fixing related bugs across the codebase.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 413522f and 2e3e2fd.

📒 Files selected for processing (1)
  • testing/python/language/test_tilelang_language_pdl.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_pdl.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (2)
testing/python/language/test_tilelang_language_pdl.py (2)

37-37: LGTM! Correct CUDA test gating.

The @tilelang.testing.requires_cuda decorator appropriately gates this test since it compiles with a CUDA target and verifies CUDA-specific PDL intrinsics.


47-47: LGTM! Correct CUDA test gating.

The @tilelang.testing.requires_cuda decorator appropriately gates this test since it compiles with a CUDA target and verifies CUDA-specific PDL synchronization intrinsics.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@Gongen-Ali Gongen-Ali changed the title Rocm ci [WIP] Rocm ci Dec 17, 2025
@Gongen-Ali Gongen-Ali force-pushed the rocm_ci branch 5 times, most recently from 00af570 to 8e77897 Compare December 29, 2025 09:42
@Gongen-Ali Gongen-Ali marked this pull request as ready for review December 29, 2025 09:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
testing/python/language/test_tilelang_language_atomic_add.py (1)

210-223: Fix critical dtype mismatch in tensor creation.

The function accepts a dtype parameter and passes it to the kernel (Line 211), but test tensors are always created with torch.float32 (Lines 214-215). This causes a type mismatch when the kernel expects a different dtype (e.g., T.float16).

All other atomic test functions in this file correctly use dtype=getattr(torch, dtype) when creating test tensors.

🔎 Proposed fix
-    A = torch.randn(M, N, dtype=torch.float32).cuda()
-    B = torch.zeros(M, N, dtype=torch.float32).cuda()
+    A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda()
+    B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

31-33: Inconsistent with broader PR pattern for target selection.

This file still uses target="cuda" while other test files in this PR were updated to use target="auto" to enable automatic backend selection. For consistency with the multi-backend support objectives, consider updating this to:

-    kernel = tilelang.compile(
-        program, out_idx=[1], target="cuda", pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True}
-    )
+    kernel = tilelang.compile(
+        program, out_idx=[1], target="auto", pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True}
+    )

Files that were updated in this PR include:

  • test_tilelang_language_mask_op.py (lines 32, 68, 105, 141)
  • test_tilelang_language_unroll.py (lines 16, 32)
testing/python/language/test_tilelang_language_mask_op.py (1)

31-36: Add @tilelang.testing.requires_cuda decorators to test functions.

The compilation uses target="auto", which auto-detects CUDA/HIP/Metal based on availability (per tilelang/utils/target.py). However, all four test functions hardcode device="cuda" for tensor creation, creating a device mismatch if target="auto" selects HIP or Metal on a system with multiple backends available.

Add decorators to:

  • test_tilelang_copy_mask_parallel (line 39)
  • test_tilelang_copy_mask_copy (line 75)
  • test_tilelang_copy_mask_parallel_range (line 112)
  • test_tilelang_copy_mask_copy_range (line 148)

This matches the pattern used in other test files in the same directory and ensures tests skip gracefully on non-CUDA systems.

testing/python/debug/test_tilelang_debug_print.py (1)

8-17: Inconsistent target specification across helper functions.

The debug_print_buffer function (line 15) doesn't specify target="auto", while other helper functions (debug_print_buffer_conditional, debug_print_value_conditional, etc.) explicitly use target="auto". This inconsistency could cause the parameterized test_debug_print_buffer to fail on ROCm-only systems.

🔎 Suggested fix
 def debug_print_buffer(M=16, N=16, dtype=T.float16):
     @T.prim_func
     def program(Q: T.Tensor((M, N), dtype)):
         with T.Kernel(4, 4, 2, threads=128 * 2) as (bx, by, bz):
             shared_buf = T.alloc_shared([M, N], dtype)
             T.print(shared_buf)

-    jit_kernel = tilelang.compile(program)
+    jit_kernel = tilelang.compile(program, target="auto")
     profiler = jit_kernel.get_profiler()
     profiler.run_once()
🧹 Nitpick comments (7)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)

386-386: LGTM! Properly gates CUDA-specific functionality.

The decorator correctly ensures this test only runs when CUDA is available, preventing failures on systems without CUDA support (particularly relevant given the PR's ROCM multi-backend focus).

Optional: Consider applying to other CUDA-dependent tests

Several other test functions in this file also use .cuda() calls without the decorator:

  • test_gemm_jit_kernel (lines 137-138)
  • test_tvm_ffi_kernel_multi_stream (line 246)
  • test_tvm_ffi_dynamic_shape (lines 285-286)

If these tests haven't been updated elsewhere in the PR, consider applying the same decorator for consistency.

testing/python/jit/test_tilelang_jit_callback.py (1)

117-117: Note: Accumulation dtype change has no effect (test is skipped).

The change from T.float16 to T.float32 accumulation dtype occurs in test_cuda_postproc_callback, which is skipped (line 107). This change will have no runtime effect until the test is re-enabled.

testing/python/language/test_tilelang_language_annot.py (1)

7-8: Track the cython backend bug for future resolution.

The TODO comments indicate a known bug affecting these tests with the cython backend, and the @requires_cuda decorators effectively disable them in certain configurations. This creates technical debt.

Do you want me to open a tracking issue for the cython backend build bug to ensure it's addressed in a future PR?

Also applies to: 31-32, 55-56

testing/python/kernel/test_tilelang_kernel_gemm.py (1)

237-252: Track the ROCm precision issue for f32f32f32_tn GEMM.

The TODO comment indicates a known precision issue with this specific GEMM configuration (transposed A, float32) on ROCm. This creates technical debt that should be tracked.

Do you want me to open a tracking issue for the ROCm precision problem with the f32f32f32_tn GEMM configuration to ensure it's investigated in a future PR?

tilelang/engine/lower.py (1)

118-120: Remove unused local variable assignment.

Line 119 assigns tl_template_path but never uses it. This appears to be leftover code from the refactoring to use the centralized TILELANG_TEMPLATE_PATH constant.

🔎 Proposed cleanup
 @tvm_ffi.register_global_func("tilelang_callback_hip_compile", override=True)
 def tilelang_callback_hip_compile(code, target):
-    project_root = osp.join(osp.dirname(__file__), "../..")
-    tl_template_path = osp.abspath(osp.join(project_root, "src"))
-
     hsaco = hipcc.compile_hip(
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)

434-436: Track ROCm precision issues.

Multiple TODO comments indicate precision problems on ROCm for specific test configurations. These disabled test cases reduce coverage for ROCm backend.

Would you like me to help create tracking issues for these ROCm precision problems to ensure they're addressed systematically?

Also applies to: 441-443, 466-468, 598-600, 631-633

src/tl_templates/hip/debug.h (1)

313-313: Inconsistent indentation: tab character instead of spaces.

Line 313 uses a tab character for indentation while the rest of the file uses spaces.

🔎 Proposed fix
-	PrintTraits<T>::print_buffer(msg, buf_name, index, var);
+  PrintTraits<T>::print_buffer(msg, buf_name, index, var);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e64961f and 8e77897.

📒 Files selected for processing (52)
  • .github/workflows/ci.yml
  • src/op/gemm.cc
  • src/op/logical.cc
  • src/target/codegen_hip.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/atomic.h
  • src/tl_templates/hip/common.h
  • src/tl_templates/hip/debug.h
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/reduce.h
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • tilelang/engine/lower.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (1)
  • testing/python/issue/test_tilelang_issue_830.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • tilelang/engine/lower.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/jit/test_tilelang_jit_callback.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_clear.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • tilelang/engine/lower.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • tilelang/intrinsics/mfma_layout.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • tilelang/intrinsics/mfma_layout.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.

Applied to files:

  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
🧬 Code graph analysis (23)
src/op/gemm.cc (1)
src/target/utils.cc (2)
  • TargetIsCDNA (73-82)
  • TargetIsCDNA (73-73)
testing/python/debug/test_device_assert.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_mask_op.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_copy.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_ptr.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/components/test_storage_rewrite_detect_inplace.py (3)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
tilelang/utils/target.py (1)
  • check_hip_availability (42-52)
src/target/codegen_hip.cc (2)
  • pattern (66-69)
  • pattern (66-67)
testing/python/issue/test_tilelang_issue_96.py (3)
tilelang/jit/__init__.py (2)
  • compile (48-108)
  • compile (353-379)
tilelang/jit/kernel.py (1)
  • out_idx (609-610)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
  • run (121-181)
  • run (188-259)
src/tl_templates/hip/debug.h (4)
tilelang/language/print_op.py (1)
  • print_var (15-25)
src/tl_templates/cuda/gemm_sp_sm80.h (1)
  • unsigned (124-126)
tilelang/language/v2/dtypes.py (2)
  • short (235-235)
  • long (238-238)
src/tl_templates/cuda/reduce.h (2)
  • half_t (17-19)
  • bfloat16_t (20-22)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (10-21)
tilelang/intrinsics/mfma_macro_generator.py (1)
tilelang/language/v2/dtypes.py (1)
  • int8 (243-243)
testing/python/language/test_tilelang_language_alias.py (1)
tilelang/jit/__init__.py (2)
  • compile (48-108)
  • compile (353-379)
src/tl_templates/hip/common.h (1)
src/tl_templates/cuda/common.h (5)
  • tl (205-299)
  • tl (528-530)
  • tl (531-533)
  • tl (548-628)
  • bfloat16_t (538-539)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (10-21)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/v2/dtypes.py (2)
  • float16 (299-299)
  • float32 (300-300)
tilelang/language/atomic.py (1)
  • atomic_addx2 (226-261)
src/tl_templates/hip/reduce.h (2)
src/tl_templates/cuda/reduce.h (3)
  • tl (10-83)
  • run (110-171)
  • run (178-250)
src/tl_templates/hip/common.h (13)
  • tl (110-200)
  • shfl_down (138-140)
  • shfl_down (157-161)
  • shfl_down (182-186)
  • shfl_xor (134-136)
  • shfl_xor (151-155)
  • shfl_xor (176-180)
  • shfl (146-148)
  • shfl (169-173)
  • shfl (194-198)
  • shfl_up (142-144)
  • shfl_up (163-167)
  • shfl_up (188-192)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
  • compile (48-108)
  • compile (353-379)
testing/python/language/test_tilelang_language_unroll.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_clear.py (2)
tilelang/jit/__init__.py (2)
  • compile (48-108)
  • compile (353-379)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_composable_index.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
tilelang/carver/arch/driver/cuda_driver.py (7)
  • get_cuda_device_properties (22-30)
  • get_device_name (33-36)
  • get_shared_memory_per_block (39-52)
  • get_device_attribute (55-78)
  • get_persisting_l2_cache_max_size (97-99)
  • get_num_sms (102-118)
  • get_registers_per_block (121-129)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
testing/python/jit/test_tilelang_jit_callback.py (2)
tilelang/engine/callback.py (2)
  • register_cuda_postproc_callback (29-59)
  • register_hip_postproc_callback (62-92)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
🔇 Additional comments (66)
testing/python/issue/test_tilelang_issue_1001.py (1)

26-26: LGTM! Correct CUDA gating for device-specific test.

The decorator appropriately gates this test behind CUDA availability, which is necessary given the hardcoded device="cuda" on line 29.

testing/python/language/test_tilelang_language_infinity.py (2)

3-3: Verify that tilelang.testing module provides the required decorator and entry point.

The changes introduce dependency on tilelang.testing.requires_cuda decorator and tilelang.testing.main() entry point. Per the PR objectives, these are part of the new test infrastructure updates for CUDA requirement decorators. However, the implementation and API of this module is not visible in the provided context.

Confirm the following:

  1. The tilelang.testing.requires_cuda decorator properly skips tests when CUDA is unavailable
  2. The tilelang.testing.main() entry point is compatible with standard test runners (pytest, unittest)
  3. Test discovery and CI/CD integration still work as expected

Additionally, given that this PR introduces HIP/ROCM support ("rocm_ci"), consider whether the test should use a decorator like requires_cuda_or_hip or run on both backends.

Also applies to: 24-24, 33-33


17-29: Verify infinity comparison logic across all dtypes, especially float16 and bfloat16.

The test compares kernel output directly with torch.inf across four dtypes. While this should work for standard IEEE 754 floats, float16 and bfloat16 have reduced precision and may exhibit different behavior.

Confirm:

  1. How does torch.inf == torch.inf behave for float16 and bfloat16 tensors?
  2. Are there any dtype promotion issues when comparing against the float32-typed torch.inf constant?
  3. Has this test been validated on actual CUDA devices with all four dtypes?

Additionally, the kernel is called without explicit device selection (line 19). Ensure the implicit device context is correctly set by the @requires_cuda decorator.

testing/python/issue/test_tilelang_issue_1008.py (1)

42-42: LGTM! Decorators correctly gate CUDA-specific tests.

The @tilelang.testing.requires_cuda decorators are appropriate since both tests explicitly use device="cuda". This ensures tests are skipped on systems without CUDA support, aligning with the multi-backend test-gating pattern in this PR.

Also applies to: 49-49

testing/python/jit/test_tilelang_jit_gemm_cython.py (3)

160-174: LGTM! Using float32 for accumulation improves numerical stability.

Accumulating in float32 while keeping float16 for input/output is a standard practice that prevents precision loss and potential overflow during GEMM operations.


210-211: Consistent precision improvement applied to benchmarking and multi-stream tests.

The float32 accumulator change is correctly applied to both the benchmarking test and multi-stream test functions.

Also applies to: 254-255


303-308: Float32 accumulation consistently applied to all dynamic shape test variants.

All dynamic shape tests (with varying combinations of dynamic M, N, K dimensions) now use float32 for accumulation, maintaining consistency with the other test functions.

Also applies to: 356-357

testing/python/jit/test_tilelang_jit_cutedsl.py (1)

100-100: LGTM! CUDA-specific tests correctly gated.

The @tilelang.testing.requires_cuda decorators appropriately gate these test functions, as they all rely on CUDA-specific PyTorch APIs (.cuda() for tensor placement and torch.cuda.Stream() for stream management). This aligns with the broader multi-backend infrastructure work.

Also applies to: 221-221, 273-273, 316-316, 366-366

testing/python/language/test_tilelang_language_vectorized_cast.py (1)

78-78: LGTM! Appropriate CUDA requirement gate.

The @tilelang.testing.requires_cuda decorator is correctly added to gate this test, which creates CUDA tensors and executes CUDA kernels. This is consistent with the other test functions in this file (lines 100, 115) and aligns with the PR's objective of adding multi-backend support.

.github/workflows/ci.yml (1)

406-407: Temporary test exclusion is acceptable for WIP.

The ignore flags for ./python/runtime and ./python/transform are appropriate for a work-in-progress ROCm implementation. However, ensure these exclusions are addressed and removed before the PR is marked ready for final merge, as they create a test coverage gap compared to the CUDA tests.

testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)

157-157: Correct dtype alignment for reference computation.

The reference result now correctly uses out_dtype instead of accum_dtype, ensuring the reference matches the actual output dtype of the kernel. This aligns with the pattern used in similar test files (test_tilelang_kernel_gemm_mma.py and test_tilelang_kernel_gemm_int4_gemm_mma.py).


164-164: Verify the intent of accumulation dtype change.

The accumulation dtype for the first test case changed from T.float16 to T.float32, which increases precision. Please confirm whether this change is:

  • A correction to fix a previously incorrect test configuration
  • An intentional modification to test higher-precision accumulation
  • Related to numerical stability or accuracy concerns

168-170: LGTM! CUDA-specific int8 test is properly gated.

The new test_assert_tl_matmul_int8 function is correctly decorated with @tilelang.testing.requires_cuda since int8 matmul uses the dp4a instruction (line 119 in tl_matmul_simt), which is CUDA-specific. This aligns with the PR's objective to support multi-backend execution with appropriate test gating.

testing/python/language/test_tilelang_language_var_init.py (1)

5-6: LGTM! Appropriate gating for unsupported HIP functionality.

The decorator correctly gates this test to CUDA-only execution, preventing failures on HIP/ROCM where variable initialization is not yet supported. The TODO comment clearly documents this limitation for future work.

testing/python/jit/test_tilelang_jit_gemm.py (1)

106-120: LGTM! Improved numerical stability with FP32 accumulation.

The function rename accurately reflects the accumulator dtype change, and switching from FP16 to FP32 accumulation significantly improves numerical stability in GEMM operations. Accumulating in higher precision (FP32) while keeping FP16 inputs and output is a best practice that reduces overflow/underflow risks during matrix multiplication.

This change is consistent with the broader pattern across GEMM/JIT tests in this PR.

testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)

167-167: LGTM! Improved numerical stability.

Changing the accumulation dtype from T.float16 to T.float32 improves numerical stability for GEMM operations while maintaining the memory and bandwidth benefits of float16 for inputs and outputs. This is a standard best practice for mixed-precision computation, especially important for multi-backend support where consistent numerical behavior is desired.

Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306

testing/python/language/test_tilelang_language_atomic_add.py (2)

198-207: LGTM: dtype parameterization added correctly.

The dtype parameterization for atomic_addx2_program is well-structured and consistent with other atomic operation functions in this file.


238-238: LGTM: CUDA requirement decorators added appropriately.

The @tilelang.testing.requires_cuda decorators correctly gate CUDA-specific atomic operations, ensuring tests only run on appropriate hardware. This aligns with the PR's multi-backend support objectives.

Also applies to: 243-243, 353-353

testing/python/jit/test_tilelang_jit_parcompile.py (1)

61-62: Verify the block_K configuration values.

The test configurations at lines 61-62 use different block_K values (64 and 32). Confirm whether these values are intentional for testing different blocking strategies or if they require explanation. If this change is part of ROCm optimization, consider adding a comment to document the rationale.

testing/python/components/test_storage_rewrite_detect_inplace.py (2)

4-6: LGTM! Appropriate use of import-time configuration.

The module-level HIP availability check is a reasonable approach for test configuration. The result is cached and reused throughout the test, avoiding repeated checks.


60-62: Backend-specific code generation is expected behavior; consider adding a documentation comment.

The test correctly validates that the inplace detection optimization works on both HIP and CUDA backends, but they generate different code patterns:

  • HIP: read[0] = (read[0] * 2); (array-style indexing)
  • CUDA: read = (read * 2); (scalar access)

This difference is intentional and stems from how each backend's code generator handles intermediate variables. However, adding a comment explaining why HIP uses array-style indexing while CUDA uses scalar access would improve code clarity for future maintainers.

testing/python/jit/test_tilelang_jit_callback.py (3)

4-4: LGTM! Import addition supports HIP backend.

The import of register_hip_postproc_callback mirrors the existing CUDA callback import and enables HIP post-processing support, aligning with the PR's multi-backend objectives.


92-96: LGTM! HIP callback correctly mirrors CUDA implementation.

The HIP post-processing callback properly mirrors the CUDA callback structure and will register for the HIP backend. The minor code duplication between the two callbacks is acceptable for test code.


232-232: No action required. The accumulation dtype T.float32 is consistent with the standard GEMM implementation pattern across the codebase. Nearly all other GEMM tests (CPU, WebGPU, Metal, Transform, Profiler, Language tests, etc.) use T.float32 for accumulation by default. This is not a HIP-specific requirement but rather the standard approach for numerical precision—the reference computation in the test itself explicitly performs accumulation in float32 before converting to the output dtype. The change is correct and requires no further adjustment.

tilelang/intrinsics/mfma_layout.py (2)

3-3: LGTM: Import addition is correct.

The addition of tvm.tir.const is appropriate for creating TIR constant expressions in the modified functions.


9-9: Use const(0) for TIR scalar constants in layout functions.

The change from convert(0) to const(0) is correct. In TIR context, const() is the proper API for creating scalar constant expressions, while convert() is a runtime conversion utility. These layout functions are used in TIR operations where their return values are unpacked and applied directly to array indices, making const(0) semantically appropriate. The existing test suite (test_tilelang_gemm_mfma_intrinsic.py) exercises these functions through MFMA matrix operations and will catch any regressions.

src/tl_templates/hip/hip_fp8.h (1)

1-1: LGTM: Standard include guard added.

Adding #pragma once is a best practice to prevent multiple inclusions of this header file.

src/op/gemm.cc (1)

157-159: LGTM: CDNA warp partitioning correctly configured.

The CDNA target is properly handled by setting kNPerWarp = 16, consistent with Volta architecture. This aligns with AMD CDNA (gfx9xx) warp characteristics and ensures correct matrix partitioning for MFMA instructions.

testing/python/issue/test_tilelang_issue_96.py (1)

40-44: Verify device compatibility with "auto" target.

The compilation target is set to "auto", but lines 43-44 explicitly create tensors on the "cuda" device. If the "auto" target selects a non-CUDA backend or CUDA is unavailable, this test will fail at runtime. Consider adding @tilelang.testing.requires_cuda decorator to the test functions (lines 52 and 57) to ensure CUDA availability, or update the tensor creation to use a device that matches the auto-selected target.

testing/python/language/test_tilelang_language_composable_index.py (1)

33-39: Verify device compatibility with "auto" target.

The compilation target has been changed to "auto", but line 39 still creates tensors with device="cuda". This creates a potential mismatch if the "auto" target selects a non-CUDA backend. Consider either:

  1. Adding @tilelang.testing.requires_cuda to test_tilelang_copy() (line 44) to ensure CUDA is available, or
  2. Making the device selection dynamic based on the compiled target.
testing/python/language/test_tilelang_language_let.py (1)

6-6: LGTM: Appropriate CUDA requirement decorator.

The @tilelang.testing.requires_cuda decorator correctly gates this test, which explicitly compiles for the "cuda" target (line 18) and verifies CUDA-specific vectorization codegen (line 19).

testing/python/language/test_tilelang_language_ptr.py (1)

44-52: Verify device compatibility with "auto" target.

The compilation target has been changed to "auto", but the test execution in run_matmul() creates all tensors with device="cuda" (lines 49-50, 52). This creates a potential runtime mismatch if the "auto" target selects a non-CUDA backend. Consider adding @tilelang.testing.requires_cuda decorator to test_matmul() at line 61 to ensure CUDA availability before running this test.

testing/python/carver/test_tilelang_carver_recommend_hints.py (1)

136-138: Consider gating other template tests for consistency.

Only test_fmha_recommend_hints() has the @tilelang.testing.requires_cuda decorator, while other template tests (test_general_reduction_recommend_hints, test_elementwise_recommend_hints, test_matmul_recommend_hints, test_gemv_recommend_hints) do not. If Flash Attention has specific CUDA requirements that other templates don't have, this is correct. Otherwise, consider whether the other tests should also be gated for consistency, especially since they all use auto_infer_current_arch() which may assume GPU availability.

testing/python/language/test_tilelang_language_alias.py (1)

48-50: No changes needed. The "auto" target in tilelang is specifically designed to automatically detect available devices (CUDA, HIP, or Metal) at runtime through the determine_target() function in tilelang/utils/target.py. The kernel compilation and execution will use whichever device is available on the system, making explicit @requires_cuda decorators unnecessary. This aligns with the codebase pattern where all language tests using target="auto" (test_tilelang_language_clear.py, test_tilelang_language_mask_op.py, test_tilelang_language_unroll.py, test_tilelang_language_copy.py, etc.) intentionally omit CUDA requirement decorators.

Likely an incorrect or invalid review comment.

testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

45-47: LGTM - Appropriate CUDA gating.

The @requires_cuda decorator correctly gates this test to CUDA-enabled environments, consistent with the device-specific operations in the test.

testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)

29-33: LGTM - Appropriate CUDA requirement guards.

All test functions now correctly require CUDA availability since they specifically test CUDA driver properties via torch.cuda and CUDA device attributes. The decorators ensure tests only run in appropriate environments.

Also applies to: 36-40, 43-47, 50-54, 57-61, 64-68, 71-75

testing/python/kernel/test_tilelang_kernel_gemm.py (1)

106-107: LGTM - Selective CUDA gating for specific GEMM configurations.

The selective application of @requires_cuda decorators indicates intentional differentiation between CUDA-specific GEMM configurations and those that can run across multiple backends. This pattern appropriately gates tests that rely on CUDA-specific optimizations or behavior.

Also applies to: 172-173, 190-191, 216-217, 255-256, 273-274

testing/python/language/test_tilelang_language_unroll.py (1)

6-33: LGTM - Appropriate differentiation for backend-specific unroll support.

The changes correctly handle backend-specific unroll pragma support:

  • test_unroll_with_step: Uses target="auto" without decorator, allowing execution on both CUDA and HIP (basic #pragma unroll is widely supported)
  • test_unroll_with_unroll_factor: Uses target="auto" with @requires_cuda decorator, gating execution to CUDA-only since HIP doesn't support parameterized unroll factors

This pattern demonstrates the correct approach for handling backend-specific feature availability while still enabling automatic target selection where possible.

src/op/logical.cc (1)

40-54: LGTM - Correct HIP backend support for logical operators.

The HIP intrinsic lowering registrations correctly extend multi-backend support for the tl.any_of and tl.all_of operators:

  • Both operators now support hip.FLowerIntrinsic alongside existing cuda.FLowerIntrinsic
  • The same lowering functions (any_of_op, all_of_op) are appropriately shared between backends
  • The lowering generates backend-agnostic external calls (tl::Any, tl::All)

This pattern aligns with the broader PR objective of introducing HIP/ROCm support.

testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (1)

199-199: LGTM! Appropriate CUDA gating for tensor core tests.

The @tilelang.testing.requires_cuda decorators correctly gate tests that use CUDA-specific MMA tensor core operations.

Also applies to: 403-403

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)

156-156: LGTM! Appropriate CUDA gating for sparse GEMM tests.

The @tilelang.testing.requires_cuda decorators correctly gate tests for CUDA-specific sparse tensor core operations.

Also applies to: 308-308, 459-459, 614-614

tilelang/engine/lower.py (1)

15-15: LGTM! Refactoring to centralized path management.

The changes correctly replace environment-variable based path resolution with centralized constants from tilelang.env, improving maintainability and consistency across CUDA and HIP compilation paths.

Also applies to: 76-77, 126-127

tilelang/jit/adapter/wrapper.py (2)

314-316: LGTM! Backend-specific declaration parsing.

The new get_declaration methods appropriately handle the different declaration styles between CUDA (which has forward declarations ending with ;) and HIP (which doesn't, so splits on {). This refactoring improves code organization and backend compatibility.

Also applies to: 612-615


585-586: LGTM! ROCm FP8 type mapping additions.

The type mappings for float8_e5m2fnuz and uint64 correctly support ROCm's FP8 types and standard unsigned 64-bit integers.

Also applies to: 592-592

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

100-100: LGTM! Appropriate CUDA gating for NVRTC tests.

All tests using the NVRTC execution backend correctly require CUDA availability, as NVRTC is CUDA-specific.

Also applies to: 221-221, 273-273, 316-316, 366-366, 438-438, 452-452

src/tl_templates/hip/common.h (3)

113-130: LGTM! Generic helper utilities.

The Any and All helper functions correctly implement short-circuit evaluation for boolean array checks.


151-198: LGTM! Proper type casting for half and bfloat16 shuffle operations.

The specializations for half_t and bfloat16_t correctly cast to float for shuffle operations and cast back, which is necessary since hardware shuffle instructions typically operate on 32-bit values.


110-200: Address HIP shuffle function synchronization gap.

The HIP implementation uses non-sync warp shuffle functions (__shfl_xor, __shfl_down, __shfl_up, __shfl) without mask parameters, while the CUDA counterparts use explicit _sync variants with mask support (e.g., shfl_xor_sync(uint32_t(-1), ...)). The TODO comment acknowledges ROCm 7.1.1 provides shfl_sync support, but the implementation has not yet been updated.

Verify that:

  1. HIP's non-sync shuffle intrinsics provide implicit or sufficient synchronization guarantees for warp-level operations in your use cases
  2. The current approach in src/tl_templates/hip/reduce.h (which calls tl::shfl_* without masks) is safe and equivalent to the CUDA masked approach, or update to use _sync variants once ROCm support is confirmed
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)

114-119: Verify accumulator dtype change from float16 to float32.

Multiple test cases have been updated to use T.float32 for dtypeAccum instead of T.float16. While this change may improve numerical accuracy, it could mask precision issues that would occur with lower precision accumulation.

Ensure this change aligns with your testing strategy and doesn't inadvertently reduce test coverage for float16 accumulation paths.

Also applies to: 272-276, 429-433, 593-596

src/tl_templates/hip/atomic.h (2)

7-20: LGTM! Appropriate overloads for pointer and reference operands.

The template overloads correctly handle both pointer (T1*) and reference (T1&) first arguments, providing flexibility for different addressing patterns during code generation.


64-104: LGTM! Vectorized atomic operations for multi-component types.

The vectorized variants (AtomicAddx2, AtomicAddx4) correctly decompose float2 and float4 operations into individual component-wise atomics, with appropriate return value variants.

src/tl_templates/hip/debug.h (1)

215-230: LGTM: Trait-based design is a good improvement.

The PrintTraits primary template with fallback behavior and macro-generated specializations provides clean extensibility and reduces code duplication.

testing/python/debug/test_device_assert.py (1)

16-16: LGTM: Target changed to "auto" for automatic backend selection.

The change from target="cuda" to target="auto" enables this test to run on both CUDA and ROCm backends, aligning with the PR's multi-backend support goal.

Also applies to: 28-28

testing/python/language/test_tilelang_language_clear.py (1)

44-51: Inconsistency: target="auto" but tensors use .cuda() explicitly.

The compilation target is set to "auto" for backend-agnostic selection, but tensor creation at lines 48-49 uses .cuda() which is CUDA-specific. On ROCm, this may still work (PyTorch ROCm maps .cuda() to HIP), but for clarity and consistency, consider using a device-agnostic approach.

Verify that .cuda() works correctly on ROCm environments, or consider using a utility function that abstracts the device selection.

testing/python/language/test_tilelang_language_alloc.py (2)

36-36: LGTM: Assertion relaxation accommodates code generation variations.

The updated assertions accept both "tmp =" and "tmp[0] =" patterns, accommodating differences in how variables are generated across backends.

Also applies to: 76-76


118-121: LGTM: Appropriate CUDA gating for unsupported ROCm features.

The @tilelang.testing.requires_cuda decorator correctly gates tests that depend on alloc_var with initializer, which is noted as unsupported on ROCm. The TODO comments provide useful context.

Also applies to: 159-162

testing/python/autotune/test_tilelang_autotune.py (1)

263-266: LGTM: CUDA requirement decorators appropriately gate autotuner tests.

The autotuner tests rely on CUDA-specific carver infrastructure (e.g., CUDA("cuda") at line 52). The @tilelang.testing.requires_cuda decorators correctly prevent these tests from running on non-CUDA environments.

Also applies to: 269-272

testing/python/language/test_tilelang_language_copy.py (1)

32-39: LGTM: Target changed to "auto" for multi-backend support.

The compilation targets are updated to "auto" consistently across the copy test functions. Note that tensor creation still uses device="cuda" - this works on ROCm via PyTorch's HIP mapping, but consider abstracting device selection for better clarity if expanding to more backends.

Also applies to: 69-79, 134-139

testing/python/language/test_tilelang_language_frontend_v2.py (1)

339-341: LGTM: Consistent CUDA gating for alloc_var with initializer.

The test_while_loop function uses T.alloc_var(T.int32, 0) with an initializer value, which is noted as unsupported on ROCm. The decorator appropriately gates this test.

src/target/codegen_hip.cc (2)

945-947: LGTM: FP8 E5M2 dtype mappings added.

The new mappings for float8_e5m2fnuzx4 and float8_e5m2fnuzx8 follow the existing pattern established for float8_e4m3fnuz variants. Using "long" for the 8-element vector is consistent with the e4m3 variant.


985-995: Warp reduction implementations are correctly in place.

All five warp reduction functions (warp_reduce_sum, warp_reduce_max, warp_reduce_min, warp_reduce_bitand, warp_reduce_bitor) are implemented in src/tl_templates/hip/reduce.h at lines 273, 277, 281, 285, and 289 respectively. The handlers in codegen_hip.cc correctly delegate to these tl:: namespace functions.

src/tl_templates/cuda/reduce.h (1)

177-186: LGTM!

The changes correctly parameterize thread indexing via SEG (defaulting to 32 for CUDA's warp size) and align the return type with the HIP counterpart. The logic for lane = tid % SEG and row = tid / SEG is consistent with CumSum1D in the same file.

testing/python/debug/test_tilelang_debug_print.py (1)

20-34: Good refactoring with pytest parameterization.

Clean separation of platform-agnostic dtype tests via @pytest.mark.parametrize and platform-specific FP8 tests using @tilelang.testing.requires_cuda / @tilelang.testing.requires_rocm decorators.

src/tl_templates/hip/reduce.h (3)

117-182: LGTM!

The CumSum1D implementation correctly uses tl::shfl_* wrappers for type-safe shuffle operations and appropriately defaults SEG=64 for HIP's 64-wide wavefronts. The static_assert correctly excludes threads==32 which would be insufficient for the HIP architecture.


184-260: LGTM!

The CumSum2D implementation is well-structured with correct use of tl::shfl_* wrappers and appropriate SEG=64 default for HIP. The axis and reverse handling logic is consistent with the CUDA version in src/tl_templates/cuda/reduce.h.


75-78: Good refactoring to use tl:: wrappers.

The updates to use tl::shfl_down and tl::shfl_xor in SharedReduceWarp and AllReduce ensure proper handling of half_t and bfloat16_t types through the specializations in common.h.

Also applies to: 106-108

./python/jit/test_tilelang_jit_cutedsl.py

# AMD ROCm tests
# runtime and transform tests needs to repair, then rm it from ignore list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix grammar and add tracking reference.

The comment has a grammatical error: "needs to repair" should be "need to be repaired" or "need repair". Additionally, consider adding a TODO marker or issue reference to track when these directories can be un-ignored.

🔎 Suggested fix
-      # runtime and transform tests needs to repair, then rm it from ignore list
+      # TODO(#XXXX): runtime and transform tests need to be repaired, then remove from ignore list

Replace #XXXX with an actual issue number. Would you like me to help create a tracking issue for this technical debt?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# runtime and transform tests needs to repair, then rm it from ignore list
# TODO(#XXXX): runtime and transform tests need to be repaired, then remove from ignore list
🤖 Prompt for AI Agents
In .github/workflows/ci.yml around line 395, the inline comment "runtime and
transform tests needs to repair, then rm it from ignore list" is grammatically
incorrect and lacks tracking: change it to a clear TODO such as "# TODO: runtime
and transform tests need to be repaired; remove from ignore list — see
ISSUE-XXXX" (or "# TODO: runtime and transform tests need repair; remove from
ignore list — see #1234"), and replace ISSUE-XXXX/#1234 with an actual issue
number or create one and reference it.

Comment on lines +251 to +49
DEFINE_PRINT_TRAIT(char, "char", "%d", int);
DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int);
DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int);
DEFINE_PRINT_TRAIT(short, "short", "%d", int);
DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int);
DEFINE_PRINT_TRAIT(int, "int", "%d", int);
DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int);
DEFINE_PRINT_TRAIT(long, "long", "%ld", long);
DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long);
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing unsigned long long specialization.

The trait defines long long but not unsigned long long. This could cause the fallback template to be used for uint64_t values, printing them as pointers instead of their actual values.

🔎 Proposed fix
 DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
+DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DEFINE_PRINT_TRAIT(char, "char", "%d", int);
DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int);
DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int);
DEFINE_PRINT_TRAIT(short, "short", "%d", int);
DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int);
DEFINE_PRINT_TRAIT(int, "int", "%d", int);
DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int);
DEFINE_PRINT_TRAIT(long, "long", "%ld", long);
DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long);
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
DEFINE_PRINT_TRAIT(char, "char", "%d", int);
DEFINE_PRINT_TRAIT(signed char, "signed char", "%d", int);
DEFINE_PRINT_TRAIT(unsigned char, "unsigned char", "%u", unsigned int);
DEFINE_PRINT_TRAIT(short, "short", "%d", int);
DEFINE_PRINT_TRAIT(unsigned short, "unsigned short", "%u", unsigned int);
DEFINE_PRINT_TRAIT(int, "int", "%d", int);
DEFINE_PRINT_TRAIT(unsigned int, "uint", "%u", unsigned int);
DEFINE_PRINT_TRAIT(long, "long", "%ld", long);
DEFINE_PRINT_TRAIT(unsigned long, "ulong", "%lu", unsigned long);
DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
🤖 Prompt for AI Agents
In src/tl_templates/hip/debug.h around lines 251 to 260, the print-trait list
defines long long but omits unsigned long long, which can cause uint64_t to fall
back to the pointer-printing trait; add a DEFINE_PRINT_TRAIT entry for unsigned
long long using the name "unsigned long long", the "%llu" format specifier, and
unsigned long long as the cast/print type (mirror the pattern used for other
unsigned types).

Comment on lines +262 to +271
template <typename T, typename ReduceOp>
TL_DEVICE T warp_reduce(T value, ReduceOp op) {
value = op(value, __shfl_xor(value, 32));
value = op(value, __shfl_xor(value, 16));
value = op(value, __shfl_xor(value, 8));
value = op(value, __shfl_xor(value, 4));
value = op(value, __shfl_xor(value, 2));
value = op(value, __shfl_xor(value, 1));
return value;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use tl::shfl_xor instead of raw __shfl_xor for type safety.

The warp_reduce function uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results or fail for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper (defined in common.h lines 148-177) that converts to float before shuffling.

This is inconsistent with the rest of the file where tl::shfl_* wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).

🔎 Proposed fix
 template <typename T, typename ReduceOp>
 TL_DEVICE T warp_reduce(T value, ReduceOp op) {
-  value = op(value, __shfl_xor(value, 32));
-  value = op(value, __shfl_xor(value, 16));
-  value = op(value, __shfl_xor(value, 8));
-  value = op(value, __shfl_xor(value, 4));
-  value = op(value, __shfl_xor(value, 2));
-  value = op(value, __shfl_xor(value, 1));
+  value = op(value, tl::shfl_xor(value, 32));
+  value = op(value, tl::shfl_xor(value, 16));
+  value = op(value, tl::shfl_xor(value, 8));
+  value = op(value, tl::shfl_xor(value, 4));
+  value = op(value, tl::shfl_xor(value, 2));
+  value = op(value, tl::shfl_xor(value, 1));
   return value;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <typename T, typename ReduceOp>
TL_DEVICE T warp_reduce(T value, ReduceOp op) {
value = op(value, __shfl_xor(value, 32));
value = op(value, __shfl_xor(value, 16));
value = op(value, __shfl_xor(value, 8));
value = op(value, __shfl_xor(value, 4));
value = op(value, __shfl_xor(value, 2));
value = op(value, __shfl_xor(value, 1));
return value;
}
template <typename T, typename ReduceOp>
TL_DEVICE T warp_reduce(T value, ReduceOp op) {
value = op(value, tl::shfl_xor(value, 32));
value = op(value, tl::shfl_xor(value, 16));
value = op(value, tl::shfl_xor(value, 8));
value = op(value, tl::shfl_xor(value, 4));
value = op(value, tl::shfl_xor(value, 2));
value = op(value, tl::shfl_xor(value, 1));
return value;
}
🤖 Prompt for AI Agents
In src/tl_templates/hip/reduce.h around lines 262 to 271, the warp_reduce
implementation uses raw HIP intrinsic __shfl_xor which breaks for
half_t/bfloat16_t; replace each __shfl_xor call with the type-safe wrapper
tl::shfl_xor (e.g., tl::shfl_xor(value, 32), etc.) so the wrapper in common.h
handles conversion to float for those types; keep the same lane masks and op
logic unchanged.

Comment on lines +248 to +254
@tilelang.testing.requires_cuda
def test_atomic_addx2_half():
run_atomic_addx2(32, 64, 8, 16, dtype=T.float16)


def test_atomic_addx2_float():
run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add missing @requires_cuda decorator for consistency.

test_atomic_addx2_half (Line 248) is decorated with @tilelang.testing.requires_cuda, but test_atomic_addx2_float (Line 253) is not. Both tests use the same underlying atomic_addx2 operation, which appears to be CUDA-specific. This inconsistency may cause test_atomic_addx2_float to fail on non-CUDA platforms.

🔎 Proposed fix
+@tilelang.testing.requires_cuda
 def test_atomic_addx2_float():
     run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines 248
to 254, the test function test_atomic_addx2_float is missing the
@tilelang.testing.requires_cuda decorator (test_atomic_addx2_half has it),
causing inconsistency and possible failures on non-CUDA hosts; add the
@tilelang.testing.requires_cuda decorator immediately above the def
test_atomic_addx2_float() line so both CUDA-specific tests are consistently
skipped when CUDA is unavailable.

Comment on lines 133 to 156
@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved")
@tilelang.testing.requires_cuda
@pytest.mark.parametrize(
"M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads",
[
(128, 128, 128, True, True, T.float8_e5m2, T.float8_e5m2, T.float32, 128, 128, 32, 2, 128),
(128, 128, 128, True, True, T.float8_e4m3, T.float8_e4m3, T.float32, 128, 128, 32, 2, 128),
],
)
def test_gemm_ss_fp8_cuda(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads):
run_gemm_ss(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads)


@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved")
@tilelang.testing.requires_rocm
@pytest.mark.parametrize(
"M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads",
[
(128, 128, 128, True, True, T.float8_e5m2fnuz, T.float8_e5m2fnuz, T.float32, 128, 128, 32, 2, 128),
(128, 128, 128, True, True, T.float8_e4m3fnuz, T.float8_e4m3fnuz, T.float32, 128, 128, 32, 2, 128),
],
)
def test_gemm_ss_fp8_rocm(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads):
run_gemm_ss(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Document FP8 test status and unskip plan.

The new FP8 GEMM tests for both CUDA and ROCm are marked with @pytest.mark.skip. The skip reason indicates temporary disabling until GEMM SS/RS/SR/RR issues are resolved. Several ROCm FP8 tests also have TODO comments about precision problems.

Consider:

  1. Creating tracking issues for the known problems referenced in the skip reasons
  2. Adding issue references in the skip decorators for traceability
  3. Documenting the expected timeline for enabling these tests

Example with issue tracking:

@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX")

Also applies to: 291-315, 450-473, 615-638

🤖 Prompt for AI Agents
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py lines 133-157: the
FP8 GEMM tests are skipped with a generic reason; create tracking issues for the
known GEMM SS/RS/SR/RR and ROCm precision problems, then update each
@pytest.mark.skip reason to include the corresponding issue number (e.g.,
"Temporarily disabling until GEMM SS issues are resolved. See issue #<id>") for
traceability, and add a short comment above each skipped block documenting the
expected timeline and next steps to re-enable the tests.

@Gongen-Ali Gongen-Ali changed the title [WIP] Rocm ci [Bugfix] Open Rocm ci test and fix some bugs. Dec 29, 2025
@Gongen-Ali Gongen-Ali force-pushed the rocm_ci branch 4 times, most recently from 0e5da23 to a9f8c40 Compare December 29, 2025 11:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (2)

12-12: Move torch import inside test functions or handle ImportError.

The module-level torch import will fail before the @tilelang.testing.requires_cuda decorator can skip tests when torch is not installed. Since the decorator only gates test execution (not module import), an ImportError here prevents the test module from being collected at all.

Consider either:

  1. Moving the torch import inside test functions that need it
  2. Wrapping the import in a try-except block at module level
  3. Using pytest's importorskip if available
🔎 Example fix: Conditional import
-import torch
+try:
+    import torch
+except ImportError:
+    torch = None

Then each test can rely on the @requires_cuda decorator to skip when torch/CUDA is unavailable.


71-76: The function get_max_dynamic_shared_size_bytes() appears to have a semantic mismatch between its name/documentation and its implementation.

The function name and docstring suggest it returns the maximum dynamic shared memory that can be requested per block. However, the implementation uses cudaDevAttrMaxSharedMemoryPerMultiprocessor (attribute 81), which returns the total shared memory capacity of a multiprocessor—not the per-block dynamic allocation limit.

These are different CUDA resource limits:

  • Per-block dynamic memory: Maximum shared memory a single block can request (typically 48KB–96KB depending on compute capability)
  • Per-multiprocessor total: Total shared memory available on an SM, shared among all blocks (96KB–192KB+)

Consider whether the function should instead use cudaDevAttrMaxSharedMemoryPerBlock (attribute 8) to match its semantic intent.

testing/python/language/test_tilelang_language_infinity.py (1)

24-24: Remove @requires_cuda or add a TODO explaining the limitation.

This test uses the basic T.infinity() operation with no documented ROCm limitation, unlike other CUDA-restricted tests (unroll, var_init, alloc_var) which include TODO comments. Similar platform-agnostic operations like clamp, clear, and others lack this decorator. Either remove it to allow ROCm testing, or add a TODO comment explaining the platform-specific constraint.

testing/python/language/test_tilelang_language_atomic_add.py (1)

226-236: Add @tilelang.testing.requires_cuda decorators to undecorated atomic tests that use CUDA.

This PR adds the @tilelang.testing.requires_cuda decorator to several tests but misses others that also call .cuda(). The following tests use CUDA and require the decorator:

  • test_atomic_add (line 226)
  • test_atomic_max (line 230)
  • test_atomic_min (line 234)
  • test_atomic_addx4 (line 360)
  • test_atomic_return_prev (line 364)
  • test_tile_atomic_add (line 368)

All underlying run_* functions call .cuda() directly, so tests will fail on non-CUDA systems without decoration.

testing/python/language/test_tilelang_language_composable_index.py (1)

39-39: Critical: Hardcoded CUDA device conflicts with auto target selection.

The PyTorch tensor is created with device="cuda", but the kernel is compiled with target="auto" (line 33). This inconsistency will cause the test to fail on ROCm systems where no CUDA device exists.

🔎 Proposed fix to use dynamic device selection
-    a = torch.randn(M, N, device="cuda", dtype=getattr(torch, dtype))
+    device = "cuda" if torch.cuda.is_available() else "hip"
+    a = torch.randn(M, N, device=device, dtype=getattr(torch, dtype))

Alternatively, if there's a project utility for device detection, use that instead.

♻️ Duplicate comments (7)
testing/python/language/test_tilelang_language_annot.py (2)

31-32: Same TODO clarity issue as above.

The same misleading wording ("disable for now") appears here. Apply the same fix as suggested for lines 7-8.


55-56: Same TODO clarity issue as above.

The same misleading wording ("disable for now") appears here. Apply the same fix as suggested for lines 7-8.

src/tl_templates/hip/reduce.h (1)

262-271: Use tl::shfl_xor instead of raw __shfl_xor for type safety.

The warp_reduce function uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper (defined in common.h) that converts to float before shuffling.

This is inconsistent with the rest of the file where tl::shfl_* wrappers are correctly used (lines 76, 107, 142, 152, etc.).

🔎 Proposed fix
 template <typename T, typename ReduceOp>
 TL_DEVICE T warp_reduce(T value, ReduceOp op) {
-  value = op(value, __shfl_xor(value, 32));
-  value = op(value, __shfl_xor(value, 16));
-  value = op(value, __shfl_xor(value, 8));
-  value = op(value, __shfl_xor(value, 4));
-  value = op(value, __shfl_xor(value, 2));
-  value = op(value, __shfl_xor(value, 1));
+  value = op(value, tl::shfl_xor(value, 32));
+  value = op(value, tl::shfl_xor(value, 16));
+  value = op(value, tl::shfl_xor(value, 8));
+  value = op(value, tl::shfl_xor(value, 4));
+  value = op(value, tl::shfl_xor(value, 2));
+  value = op(value, tl::shfl_xor(value, 1));
   return value;
 }
src/tl_templates/hip/debug.h (1)

40-49: Missing unsigned long long specialization.

The trait defines long long but not unsigned long long. This could cause the fallback template to be used for uint64_t values, printing them as pointers instead of their actual values.

🔎 Proposed fix
 DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
+DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
testing/python/language/test_tilelang_language_alloc.py (1)

159-160: Same tracking issue suggestion applies here.

As with line 118-119, consider adding a tracking issue reference to ensure ROCm support for multiple variable initialization is tracked and implemented.

testing/python/issue/test_tilelang_issue_1008.py (1)

49-53: LGTM! Same consideration as above.

The @tilelang.testing.requires_cuda decorator correctly gates this test for CUDA availability, matching the hardcoded device="cuda" on line 52. This prevents failures in ROCm-only CI environments.

The same consideration applies: since the kernel logic is device-agnostic and the PR adds ROCm support, consider whether ROCm test coverage should be added for this issue.

testing/python/language/test_tilelang_language_atomic_add.py (1)

253-254: Add missing @requires_cuda decorator (previously flagged).

test_atomic_addx2_float is missing the @tilelang.testing.requires_cuda decorator, while test_atomic_addx2_half (line 248) has it. Both tests invoke run_atomic_addx2, which calls .cuda() and will fail on non-CUDA platforms. This issue was raised in a previous review but remains unaddressed.

🔎 Required fix
+@tilelang.testing.requires_cuda
 def test_atomic_addx2_float():
     run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
🧹 Nitpick comments (8)
testing/python/language/test_tilelang_language_var_init.py (1)

6-7: CUDA gating aligns with the HIP limitation.

The decorator correctly restricts this test to CUDA since var init is not yet supported on HIP/ROCm.

Optionally, consider making the TODO more actionable by:

  • Referencing a tracking issue number (if one exists)
  • Briefly noting what blocks HIP support (e.g., missing codegen, runtime limitation)

This would help future contributors understand the scope of work needed.

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)

133-156: Inconsistent skip decorator usage across FP8 tests.

The FP8 tests for SS and RS variants are marked with @pytest.mark.skip, but the SR and RR FP8 tests (lines 448-470, 612-634) are not skipped. This creates inconsistency in which FP8 tests run by default.

Additionally, as noted in past review comments, consider adding issue tracking references to the skip reasons for better traceability.

Verify whether the SR and RR FP8 tests should also be skipped for consistency, or if the SS and RS tests can be unskipped.

Also applies to: 290-313


432-441: TODO comments lack issue tracking references.

Multiple TODO comments reference precision problems on ROCm but don't include tracking issue numbers. Consider creating issues for these known problems and referencing them in the comments for better traceability and project management.

Also applies to: 463-465, 595-597, 627-629

testing/python/debug/test_tilelang_debug_print.py (1)

50-50: Approved: Explicit target="auto" for cross-platform support.

The explicit target="auto" specification aligns with the PR objective to enable ROCm CI testing and ensures these tests run on the appropriate available hardware.

Note: Line 15 in debug_print_buffer doesn't explicitly specify target="auto". While it defaults to "auto" per the documentation, consider adding it for consistency with the rest of the file.

🔎 Optional consistency fix for line 15
-    jit_kernel = tilelang.compile(program)
+    jit_kernel = tilelang.compile(program, target="auto")

Also applies to: 69-69, 88-88, 107-107

testing/python/language/test_tilelang_language_alloc.py (1)

118-119: Consider adding a tracking issue reference to the TODO.

The CUDA-only gating is reasonable given incomplete ROCm support for variable initialization. However, the TODO comment would benefit from a reference to a tracking issue to ensure ROCm support isn't forgotten.

Example format:

# TODO(Gong): ROCm is not supported yet, disable for now (tracked in #XXXX)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)

29-33: Consider avoiding private torch API.

Line 33 references torch.cuda._CudaDeviceProperties, which is a private class (indicated by the underscore prefix). Private APIs can change without notice in future torch versions.

If type validation is essential, consider checking for the presence of expected attributes instead, or use duck typing:

assert hasattr(prop, 'name') and hasattr(prop, 'multi_processor_count'), \
    "Returned object does not have expected CUDA device properties"
testing/python/language/test_tilelang_language_atomic_add.py (2)

198-198: Consider using consistent dtype defaults across all atomic operation functions.

The atomic_addx2_program and run_atomic_addx2 functions now default to dtype=T.float16, while all other atomic operation functions in this file default to dtype=T.float32 (e.g., atomic_add_program, atomic_max_program, atomic_min_program, etc.). This inconsistency may confuse users expecting uniform behavior across the API.

🔎 Proposed fix to align with float32 default
-def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16):
+def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float32):
-def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float16):
+def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float32):

If float16 is the intended default for this specific operation, consider documenting the rationale.

Also applies to: 210-210


214-215: Simplify tensor creation by using the target dtype directly.

These lines create tensors in float32 and then convert to the target dtype. Other functions in this file (e.g., run_atomic_add at lines 30-31, run_atomic_max at lines 98-99) create tensors directly with the target dtype using dtype=getattr(torch, dtype), which is more efficient and consistent.

🔎 Proposed simplification
-    A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
-    B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
+    A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda()
+    B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e77897 and 066682f.

📒 Files selected for processing (52)
  • .github/workflows/ci.yml
  • src/op/gemm.cc
  • src/op/logical.cc
  • src/target/codegen_hip.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/atomic.h
  • src/tl_templates/hip/common.h
  • src/tl_templates/hip/debug.h
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/reduce.h
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • tilelang/engine/lower.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (1)
  • testing/python/issue/test_tilelang_issue_830.py
✅ Files skipped from review due to trivial changes (2)
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/atomic.h
🚧 Files skipped from review as they are similar to previous changes (15)
  • .github/workflows/ci.yml
  • tilelang/intrinsics/mfma_macro_generator.py
  • src/op/gemm.cc
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/language/test_tilelang_language_alias.py
  • tilelang/intrinsics/mfma_layout.py
  • testing/python/issue/test_tilelang_issue_96.py
  • tilelang/jit/adapter/wrapper.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/language/test_tilelang_language_var_init.py
  • tilelang/engine/lower.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.

Applied to files:

  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • testing/python/issue/test_tilelang_issue_1008.py
  • tilelang/engine/lower.py
🧬 Code graph analysis (14)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
testing/python/debug/test_device_assert.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/debug/test_tilelang_debug_print.py (2)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/language/test_tilelang_language_copy.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
  • run (121-181)
  • run (188-259)
testing/python/kernel/test_tilelang_kernel_gemm_simt.py (1)
testing/python/kernel/test_tilelang_kernel_gemm.py (1)
  • matmul (6-49)
testing/python/language/test_tilelang_language_unroll.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
  • tl (110-200)
  • shfl_down (138-140)
  • shfl_down (157-161)
  • shfl_down (182-186)
  • shfl_xor (134-136)
  • shfl_xor (151-155)
  • shfl_xor (176-180)
  • shfl (146-148)
  • shfl (169-173)
  • shfl (194-198)
  • shfl_up (142-144)
  • shfl_up (163-167)
  • shfl_up (188-192)
testing/python/jit/test_tilelang_jit_callback.py (2)
tilelang/engine/callback.py (2)
  • register_cuda_postproc_callback (44-74)
  • register_hip_postproc_callback (77-107)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
testing/python/language/test_tilelang_language_composable_index.py (1)
tilelang/language/ast/ir.py (1)
  • target (1677-1707)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
src/target/codegen_hip.cc (1)
tilelang/language/reduce_op.py (5)
  • warp_reduce_sum (387-400)
  • warp_reduce_max (403-416)
  • warp_reduce_min (419-432)
  • warp_reduce_bitand (435-448)
  • warp_reduce_bitor (451-464)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
testing/python/amd/test_tilelang_test_amd.py (1)
  • run_gemm_rs (195-239)
maint/gemm_v2/correctness_evaluation.py (2)
  • run_gemm_rs (178-210)
  • run_gemm_sr (262-295)
testing/python/carver/test_tilelang_carver_cuda_driver_properties.py (1)
tilelang/carver/arch/driver/cuda_driver.py (7)
  • get_cuda_device_properties (22-30)
  • get_device_name (33-36)
  • get_shared_memory_per_block (39-52)
  • get_device_attribute (55-78)
  • get_persisting_l2_cache_max_size (97-99)
  • get_num_sms (102-118)
  • get_registers_per_block (121-129)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (57)
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

45-47: LGTM!

The @tilelang.testing.requires_cuda decorator is appropriate here since the test explicitly targets CUDA (target="cuda" on line 32) and allocates tensors on the CUDA device (device="cuda" on line 34). This ensures the test is skipped gracefully in environments without CUDA support, aligning with the broader test-gating pattern in this PR.

testing/python/jit/test_tilelang_jit_gemm_cython.py (1)

169-169: LGTM! Accumulation dtype upgraded to float32 for improved numerical stability.

The change from T.float16 to T.float32 for the accumulation dtype across all GEMM tests is appropriate. Using higher-precision accumulation while keeping float16 inputs/outputs is a common practice that improves numerical accuracy without significantly impacting memory bandwidth, especially important for ROCm/HIP support.

Also applies to: 211-211, 255-255, 304-304, 306-306, 308-308, 357-357

testing/python/jit/test_tilelang_jit_gemm.py (1)

106-106: LGTM! Test name and accumulation dtype updated consistently.

The function name change from test_gemm_f16f16f16_nn_kernel_jit to test_gemm_f16f16f32_nn_kernel_jit correctly reflects the accumulation dtype upgrade to T.float32. This maintains naming consistency and improves test clarity.

Also applies to: 115-115

testing/python/jit/test_tilelang_jit_cutedsl.py (1)

158-158: LGTM! CUDA gating correctly applied.

The @tilelang.testing.requires_cuda decorators correctly gate these CuTeDSL tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.

Also applies to: 210-210, 253-253, 303-303

testing/python/autotune/test_tilelang_autotune.py (1)

263-263: LGTM! CUDA gating correctly applied.

The @tilelang.testing.requires_cuda decorators correctly gate these autotuning tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.

Also applies to: 269-269

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)

156-156: LGTM! CUDA gating correctly applied.

The @tilelang.testing.requires_cuda decorators correctly gate these sparse GEMM v2 tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.

Also applies to: 308-308, 459-459, 614-614

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (1)

114-118: LGTM! Improved numerical stability with float32 accumulation.

The change from T.float16 to T.float32 for dtypeAccum in multiple test parameterizations improves numerical accuracy for float16 GEMM operations by using higher-precision accumulation.

Also applies to: 271-275, 427-431, 590-593

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

158-158: LGTM! CUDA gating correctly applied.

The @tilelang.testing.requires_cuda decorators correctly gate these NVRTC backend tests to CUDA-enabled environments, consistent with the broader test gating pattern across the PR.

Also applies to: 210-210, 253-253, 303-303, 375-375, 389-389

testing/python/language/test_tilelang_language_clear.py (1)

44-44: LGTM! Target change enables automatic backend selection.

The change from target="cuda" to target="auto" aligns with the PR objectives to support ROCm and automatic backend detection.

testing/python/language/test_tilelang_language_copy.py (4)

32-32: LGTM! Target change enables automatic backend selection.

The change from target="cuda" to target="auto" aligns with the PR objectives to support ROCm and automatic backend detection.


69-69: LGTM! Consistent target change.


134-134: LGTM! Consistent target change.


146-151: Verify target parameter consistency for FP8 tests.

The FP8 tests at lines 148 and 167 (in run_tilelang_copy_fp8_e8m0 and run_tilelang_copy_fp4) don't specify a target parameter in tilelang.compile(), unlike the other test functions in this file that now use target="auto". Confirm whether this omission is intentional or if these should also specify target="auto" for consistency.

testing/python/debug/test_device_assert.py (1)

16-16: LGTM! Target changes enable automatic backend selection.

The changes from target="cuda" to target="auto" at lines 16 and 28 align with the PR objectives.

testing/python/language/test_tilelang_language_mask_op.py (1)

32-32: LGTM! Target changes enable automatic backend selection.

All four occurrences (lines 32, 68, 105, 141) consistently change from target="cuda" to target="auto", aligning with the PR objectives to support ROCm and automatic backend detection.

testing/python/language/test_tilelang_language_unroll.py (2)

16-16: LGTM! Target change enables automatic backend selection.


20-33: New CUDA-specific unroll test added.

The new test correctly verifies unroll pragma generation with an unroll factor. The combination of @tilelang.testing.requires_cuda decorator and target="auto" is appropriate here: the decorator ensures the test runs only when CUDA is available (since unroll_factor is not supported on HIP per the TODO), while target="auto" allows the compilation to auto-detect the backend within that CUDA environment.

testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)

157-157: Correct bug fix: reference should use out_dtype.

The kernel returns a tensor of type out_dtype (see line 81), not accum_dtype. This change correctly aligns the reference computation with the actual kernel output type.


168-170: Good addition: int8 test with proper CUDA gating.

The new test adds coverage for the int8 data path with int32 accumulation, and the @tilelang.testing.requires_cuda decorator appropriately gates the test for CUDA availability.


164-164: No action needed. The test at line 164 was written with T.float32 accumulation as part of the original file creation, not as a change from T.float16. Additionally, test coverage for float16 accumulation already exists in testing/python/kernel/test_tilelang_kernel_gemm_mma_intrinsic.py (line 222), which uses assert_tl_matmul_correctness(128, 128, 128, T.float16, T.float16, T.float16).

Likely an incorrect or invalid review comment.

testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)

199-199: Appropriate CUDA gating for MMA-based test.

The @tilelang.testing.requires_cuda decorator correctly gates this test, which relies on CUDA tensors and tensor core MMA operations.


403-403: Appropriate CUDA gating added to existing decorators.

The additional @tilelang.testing.requires_cuda decorator properly gates this test for CUDA availability, complementing the existing package and LLVM requirements.

testing/python/components/test_storage_rewrite_detect_inplace.py (2)

4-6: No issues found. The check_hip_availability function exists in tilelang/utils.target, returns a boolean as expected, and is imported and used correctly in the test file. The module-level initialization of _IS_HIP_AVAILABLE is appropriate for gating platform-specific test assertions.


60-62: The test already verifies patterns against actual generated kernel source.

The assertions check the pattern count directly in the kernel source returned by kernel.get_kernel_source(), so the HIP and CUDA patterns are automatically verified against actual codegen output at test runtime. Pattern correctness cannot be confirmed through static codebase analysis; it will be validated when the test executes.

testing/python/carver/test_tilelang_carver_recommend_hints.py (1)

136-136: Clarify the intent behind selective CUDA gating for FMHA test.

The @tilelang.testing.requires_cuda decorator is applied only to test_fmha_recommend_hints, while the other four test functions (test_general_reduction_recommend_hints, test_elementwise_recommend_hints, test_matmul_recommend_hints, test_gemv_recommend_hints) do not have it. All tests use auto_infer_current_arch() and carver templates.

Notably, the FlashAttentionTemplate implementation contains no obvious CUDA-specific code—it uses generic TVM operations like te.placeholder() and te.compute(). The requires_cuda decorator is imported from TVM's testing utilities and checks for CUDA availability at runtime.

Given the PR context ("Open Rocm ci test and fix some bugs"), this selective gating may be intentional (e.g., FMHA not yet validated on ROCm), but this should be documented if it's a deliberate restriction rather than an oversight.

testing/python/debug/test_tilelang_debug_print.py (2)

2-2: LGTM: pytest import for parametrization.

The import is necessary for the pytest.mark.parametrize decorator added below and follows standard testing practices.


20-24: Good refactor: Parametrized dtype testing.

The parametrization consolidates testing across 12 dtypes and appropriately separates FP8 types into hardware-specific tests (CUDA-only and ROCm-only variants). Bfloat16 and other parameterized dtypes are widely supported across both CUDA and ROCm platforms.

src/op/logical.cc (2)

53-54: Verify HIP backend support for the external call.

Similar to tl.any_of, the HIP intrinsic registration for tl.all_of reuses the lowering function that generates a call_extern to "tl::All". Ensure that the HIP backend/runtime provides an implementation for this external symbol.

The verification script provided in the previous comment will also check for "tl::All" implementations.


45-46: HIP backend support verified for external calls.

The HIP intrinsic registrations at lines 45-46 and 53-54 are correct. Both tl::Any and tl::All implementations exist in src/tl_templates/hip/common.h with identical logic to their CUDA counterparts in src/tl_templates/cuda/common.h, confirming that the call_extern references will resolve properly at compile time.

src/target/codegen_hip.cc (2)

945-947: LGTM! FP8 e5m2 dtype mappings added for MFMA.

The mappings for float8_e5m2fnuzx4 and float8_e5m2fnuzx8 are consistent with the existing e4m3 variants, correctly mapping to fp8_e5_4_t and long respectively.


985-994: LGTM! Warp reduction intrinsics for HIP.

The codegen correctly emits calls to tl::warp_reduce_* functions that correspond to the Python intrinsics defined in tilelang/language/reduce_op.py. The pattern is consistent with other intrinsic handling in this file.

tilelang/engine/lower.py (3)

13-13: LGTM! Centralized path constants.

Importing path constants from tilelang.env centralizes configuration and removes duplicated path resolution logic.


71-75: LGTM! Clean path usage for CUDA compilation.

Using the centralized TILELANG_TEMPLATE_PATH and CUTLASS_INCLUDE_DIR constants simplifies the include path configuration.


118-122: LGTM! Clean path usage for HIP compilation.

Consistently uses TILELANG_TEMPLATE_PATH and COMPOSABLE_KERNEL_INCLUDE_DIR for HIP compilation, mirroring the CUDA path pattern.

src/tl_templates/cuda/reduce.h (2)

177-179: LGTM! CumSum2D signature and SEG parameterization.

The return type change to void is appropriate since the function writes results to dst rather than returning a value. The configurable SEG template parameter (defaulting to 32 for CUDA warp size) provides flexibility while maintaining sensible defaults.


185-191: LGTM! Improved thread/lane calculation and boundary check.

The lane = tid % SEG and row = tid / SEG pattern allows proper mapping to 2D positions with configurable segment size. The early return on gRow >= H correctly constrains processing to valid rows.

src/tl_templates/hip/reduce.h (4)

76-77: LGTM! Type-safe shuffle wrapper usage.

Correctly uses tl::shfl_down instead of the raw HIP intrinsic, ensuring proper handling of half_t and bfloat16_t types through the specializations in common.h.


107-108: LGTM! Type-safe shuffle wrapper in AllReduce.

Uses tl::shfl_xor for consistent type-safe shuffling across all data types.


117-182: LGTM! CumSum1D implementation for HIP.

The 1D cumulative sum correctly uses tl::shfl_down, tl::shfl_up, and tl::shfl wrappers throughout. The SEG=64 default is appropriate for AMD wavefront size.


184-260: LGTM! CumSum2D implementation for HIP.

The 2D cumulative sum mirrors the CUDA version with appropriate use of tl::shfl_* wrappers and SEG=64 default for AMD wavefront size. The Axis-driven indexing logic is consistent with the CUDA implementation.

src/tl_templates/hip/common.h (4)

3-5: LGTM! New includes for modular headers.

Adding atomic.h (for relocated atomic operations) and amd_warp_functions.h (for warp intrinsics) properly modularizes the codebase.


110-130: LGTM! Any/All utility functions.

Simple and correct implementations of warp-level any/all predicates.


132-148: LGTM! Generic shuffle wrapper templates.

The generic templates provide a clean abstraction over HIP's raw __shfl_* intrinsics.


150-198: LGTM! Type-safe shuffle specializations for half/bfloat16.

The specializations for half_t and bfloat16_t correctly convert to float before shuffling and back after, ensuring proper warp communication for these 16-bit types that HIP intrinsics don't natively support.

src/tl_templates/hip/debug.h (5)

6-21: LGTM! Clean trait-based debug printing architecture.

The PrintTraits<T> primary template provides a sensible fallback that prints the address for unknown types, while the macro-generated specializations handle known types with proper format specifiers.


23-38: LGTM! Macro for generating print traits.

The DEFINE_PRINT_TRAIT macro eliminates boilerplate for each type specialization while maintaining consistent formatting.


56-57: LGTM! FP8 type support in debug printing.

Adding fp8_e4_t and fp8_e5_t traits ensures these new types can be properly debug-printed with float conversion.


76-90: LGTM! Generic pointer specialization.

The partial specialization PrintTraits<T*> handles all pointer types uniformly, correctly using %p format specifier.


92-100: LGTM! Simplified debug print functions.

The public debug_print_var and debug_print_buffer_value functions now simply delegate to PrintTraits<T>, keeping the interface clean while the trait system handles type dispatch.

testing/python/language/test_tilelang_language_alloc.py (1)

36-36: The relaxed assertions at lines 36 and 76 are intentional and correctly designed. These tests validate a basic feature that works on both CUDA and ROCm backends, and both code generation patterns ("tmp =" and "tmp[0] =") are valid outputs. The tests that lack ROCm support—test_alloc_var_with_initializer and test_alloc_multi_vars_with_initializer—are already properly gated with @tilelang.testing.requires_cuda decorators and TODO comments. No changes needed.

Likely an incorrect or invalid review comment.

testing/python/issue/test_tilelang_issue_1008.py (1)

42-46: LGTM. The @tilelang.testing.requires_cuda decorator is correctly applied to gate these tests for CUDA availability, matching the hardcoded device="cuda" on lines 45 and 52, which prevents failures in ROCm-only or non-CUDA CI environments.

testing/python/jit/test_tilelang_jit_callback.py (4)

4-4: LGTM: HIP callback import added for multi-platform support.

The import correctly adds register_hip_postproc_callback to support HIP/ROCm platforms alongside CUDA.


93-96: LGTM: HIP callback implementation correctly mirrors CUDA callback.

The HIP post-processing callback follows the same pattern as the CUDA callback (lines 88-91), enabling multi-platform support. Both callbacks being registered within the same function scope is correct—only the relevant callback will be invoked based on the compilation target.


232-232: LGTM: Consistent accumulation dtype change.

This change matches the adjustment on line 117, ensuring consistent use of T.float32 for accumulation across both test functions.


117-117: Accumulation dtype change is consistent with codebase patterns.

The change to T.float32 for accumulation aligns with all similar tests across the codebase (test_tilelang_jit_tvm_ffi.py, test_tilelang_jit_nvrtc.py, test_tilelang_jit_nullptr.py, and test_tilelang_jit_parcompile.py all use accum_dtype=T.float32). This follows numerical best practices for matrix operations.

testing/python/language/test_tilelang_language_infinity.py (2)

3-3: LGTM!

The import is properly used for the test decorator and test runner.


33-33: LGTM!

Standard test runner invocation is correct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
testing/python/cache/test_tilelang_kernel_cache.py (1)

250-250: Same verification needed as test_cache_miss_detection.

This test also only compiles without GPU execution. Verify whether CUDA-specific gating is necessary for cache isolation testing, especially in the context of enabling ROCm CI support.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 066682f and 0e7749b.

📒 Files selected for processing (1)
  • testing/python/cache/test_tilelang_kernel_cache.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/cache/test_tilelang_kernel_cache.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (1)
testing/python/cache/test_tilelang_kernel_cache.py (1)

199-199: The @requires_cuda decorator on this test is appropriate. All backends tested (tvm_ffi, cython, nvrtc, cutedsl) are CUDA-based, as stated in the test docstring. There are no ROCm/HIP variants or backends in this test file, unlike other test files that define separate @requires_rocm tests for alternative backends. The decorator correctly reflects the actual dependencies.

Likely an incorrect or invalid review comment.

return cache_dir


@tilelang.testing.requires_cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the test file structure around line 121
head -135 testing/python/cache/test_tilelang_kernel_cache.py | tail -30

Repository: tile-ai/tilelang

Length of output: 1029


🏁 Script executed:

# Check the test functions and decorator additions mentioned
rg -n "@tilelang.testing.requires_cuda" testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 167


🏁 Script executed:

# Look for backend definitions and usage
rg -n "BACKENDS" testing/python/cache/test_tilelang_kernel_cache.py -A 10

Repository: tile-ai/tilelang

Length of output: 1831


🏁 Script executed:

# Check for ROCm/HIP support in the codebase
rg -n "rocm|hip|amdgpu" --type=py | head -30

Repository: tile-ai/tilelang

Length of output: 2890


🏁 Script executed:

# Check the requires_cuda decorator implementation
find . -name "*.py" -type f | xargs grep -l "requires_cuda" | head -10

Repository: tile-ai/tilelang

Length of output: 629


🏁 Script executed:

# Look for decorator implementation
rg -n "def requires_cuda|requires_cuda.*=" --type=py

Repository: tile-ai/tilelang

Length of output: 1259


🏁 Script executed:

# Look at test_disk_cache_with_postproc implementation
sed -n '121,200p' testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 2835


🏁 Script executed:

# Look at test_cache_miss_detection implementation
sed -n '199,250p' testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 1816


🏁 Script executed:

# Look at test_cache_isolation_between_tests implementation
sed -n '250,300p' testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 1474


🏁 Script executed:

# Check if there are .cuda() calls in the test file
rg -n "\.cuda\(\)" testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 175


🏁 Script executed:

# Check for any GPU execution or device calls
rg -n "tvm\.cuda|device|gpu|\.to_device" testing/python/cache/test_tilelang_kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 144


🏁 Script executed:

# Check if tilelang.compile requires CUDA or can work without it
rg -n "def compile" tilelang --type=py -A 5 | head -40

Repository: tile-ai/tilelang

Length of output: 2784


🏁 Script executed:

# Check if the backends can compile without GPU
rg -n "execution_backend" tilelang --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 1717


🏁 Script executed:

# Check if there are other tests that compile without requires_cuda
rg -B 3 "tilelang.compile" testing --type=py | grep -A 3 "def test_" | head -30

Repository: tile-ai/tilelang

Length of output: 42


Two tests should not have the @tilelang.testing.requires_cuda decorator.

test_disk_cache_with_postproc correctly has the decorator since it executes kernels on GPU (lines 187-188). However, test_cache_miss_detection and test_cache_isolation_between_tests only call tilelang.compile() without any GPU execution and should not be gated to CUDA-only.

Given the PR objective to "Open Rocm ci test", decorating compilation-only tests with requires_cuda will skip them entirely on ROCm CI. Remove the @tilelang.testing.requires_cuda decorator from test_cache_miss_detection (line 199) and test_cache_isolation_between_tests (line 250) so these tests run on all platforms.

🤖 Prompt for AI Agents
testing/python/cache/test_tilelang_kernel_cache.py lines 199 and 250: the tests
`test_cache_miss_detection` (line 199) and `test_cache_isolation_between_tests`
(line 250) are compilation-only and should not be CUDA-gated; remove the
`@tilelang.testing.requires_cuda` decorator from both test definitions so they
run on all platforms (leave the decorator on `test_disk_cache_with_postproc`
which runs GPU kernels).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
testing/python/language/test_tilelang_language_vectorize.py (2)

51-57: Other CUDA-dependent tests lack the requires_cuda decorator.

For consistency with the gating applied to test_vectorize_all_dtypes, consider also decorating test_vectorize and test_vectorize_invariant_index with @tilelang.testing.requires_cuda, as they also create CUDA tensors (via run_vectorize and run_vectorize_invariant_index).

🔎 Suggested change
+@tilelang.testing.requires_cuda
 def test_vectorize():
     N, M = 512, 256
+@tilelang.testing.requires_cuda
 def test_vectorize_invariant_index():
     N, M = 512, 256

127-148: Consider adding an assertion to verify kernel output.

The decorator addition for CUDA gating is appropriate. However, the test only verifies that the kernel executes without error—it doesn't assert that the output tensor x contains the expected values (i.e., [1, 2, 3, ..., vec_num, 0, 0, ...] cast to the given dtype).

Additionally, torch.float8_e8m0fnu (line 141) was introduced in PyTorch 2.7+ and may not be available in all environments.

🔎 Suggested assertion example
 def test_vectorize_all_dtypes(dtype, vec_num):
     x = torch.empty((64,), dtype=dtype, device="cuda")
     kernel = vectorize_test_all_dtypes(dtype, vec_num)
     kernel(x)
+    # Verify first vec_num elements contain expected values
+    expected = torch.arange(1, vec_num + 1, dtype=dtype, device="cuda")
+    torch.testing.assert_close(x[:vec_num], expected)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e7749b and 3e0a865.

📒 Files selected for processing (1)
  • testing/python/language/test_tilelang_language_vectorize.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_vectorize.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

Copy link
Member

@LeiWang1999 LeiWang1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left some comments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
testing/python/language/test_tilelang_language_var_init.py (1)

17-17: Fix comment typo.

The comment references variable c, but the actual variable name is d.

🔎 Proposed fix
-                d = T.alloc_var(T.int32, init=a)  # c gets new value of a
+                d = T.alloc_var(T.int32, init=a)  # d gets new value of a
♻️ Duplicate comments (6)
testing/python/cache/test_tilelang_kernel_cache.py (2)

199-199: Remove CUDA-gating from compilation-only test.

test_cache_miss_detection only calls tilelang.compile() without any GPU execution. Gating it with @tilelang.testing.requires_cuda will skip it entirely on ROCm CI, contradicting the PR objective to "Open Rocm ci test." Remove this decorator so the test runs on all platforms including ROCm.

🔎 Proposed fix
-@tilelang.testing.requires_cuda
 @pytest.mark.parametrize("backend", BACKENDS)
 def test_cache_miss_detection(clean_cache_env, backend):

250-250: Remove CUDA-gating from compilation-only test.

test_cache_isolation_between_tests only calls tilelang.compile() without any GPU execution. Gating it with @tilelang.testing.requires_cuda will skip it entirely on ROCm CI, contradicting the PR objective to "Open Rocm ci test." Remove this decorator so the test runs on all platforms including ROCm.

🔎 Proposed fix
-@tilelang.testing.requires_cuda
 @pytest.mark.parametrize("backend", BACKENDS)
 def test_cache_isolation_between_tests(clean_cache_env, backend):
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)

133-156: Add tracking issue references to skip decorators.

The FP8 GEMM SS tests are skipped with a generic reason. Per previous review feedback, create tracking issues for the known GEMM SS problems and reference them in the skip decorators for traceability.

Example:

@pytest.mark.skip(reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX")

290-313: Add tracking issue references to skip decorators.

Similar to the SS variant, these FP8 GEMM RS tests need tracking issue references in their skip decorators for better traceability.

testing/python/language/test_tilelang_language_atomic_add.py (1)

248-254: Add missing @tilelang.testing.requires_cuda decorator for consistency.

test_atomic_addx2_half (Line 248) is decorated with @tilelang.testing.requires_cuda, but test_atomic_addx2_float (Line 253) is not. Both tests use the same underlying atomic_addx2 operation, which is CUDA-specific. This inconsistency may cause test_atomic_addx2_float to fail on non-CUDA platforms.

🔎 Proposed fix
+@tilelang.testing.requires_cuda
 def test_atomic_addx2_float():
     run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
src/tl_templates/hip/reduce.h (1)

262-271: Use tl::shfl_xor instead of raw __shfl_xor for type safety.

The warp_reduce function uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper (defined in common.h lines 134-180) that converts to float before shuffling.

This is inconsistent with the rest of the file where tl::shfl_* wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).

🔎 Proposed fix
 template <typename T, typename ReduceOp>
 TL_DEVICE T warp_reduce(T value, ReduceOp op) {
-  value = op(value, __shfl_xor(value, 32));
-  value = op(value, __shfl_xor(value, 16));
-  value = op(value, __shfl_xor(value, 8));
-  value = op(value, __shfl_xor(value, 4));
-  value = op(value, __shfl_xor(value, 2));
-  value = op(value, __shfl_xor(value, 1));
+  value = op(value, tl::shfl_xor(value, 32));
+  value = op(value, tl::shfl_xor(value, 16));
+  value = op(value, tl::shfl_xor(value, 8));
+  value = op(value, tl::shfl_xor(value, 4));
+  value = op(value, tl::shfl_xor(value, 2));
+  value = op(value, tl::shfl_xor(value, 1));
   return value;
 }
🧹 Nitpick comments (8)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)

373-378: Consider adding @tilelang.testing.requires_cuda decorator for consistency.

The test already has runtime Hopper GPU checking, but adding the @tilelang.testing.requires_cuda decorator would be more consistent with test_tvm_ffi_l2_persistent_map since TMA (Tensor Memory Accelerator) is CUDA-specific architecture.

🔎 Proposed addition
+@tilelang.testing.requires_cuda
 def test_tvm_ffi_im2col_tma_desc():
     """Test im2col TMA descriptor with tvm_ffi backend."""
     if not check_hopper():
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)

432-441: Convert commented-out tests to explicit skip decorators.

The commented-out test cases with TODO notes about ROCm precision problems reduce visibility and make systematic re-enabling harder. Consider:

  1. Creating tracking issues for the precision problems
  2. Uncommenting the test cases and marking them with @pytest.mark.skip decorators that reference the issues
  3. Adding platform-specific skip conditions if needed

This approach improves traceability and makes it easier to re-enable tests once issues are resolved.

Example refactor
@tilelang.testing.requires_rocm
@pytest.mark.skip(reason="ROCm precision problem for int8 SR variant. See issue #XXXX")
@pytest.mark.parametrize(...)
def test_gemm_sr_int8_rocm_precision(...):
    ...

459-470: Use explicit skip decorator for precision-problematic test case.

Line 464 has a commented-out float8_e5m2fnuz test with a TODO about precision problems. Similar to the earlier recommendation, use an explicit skip decorator with a tracking issue reference instead of commenting out the test.


595-597: Convert commented-out tests to explicit skip decorators.

Similar to Lines 432-441, these commented-out test cases should be converted to explicit skip decorators with tracking issue references for better visibility and traceability.


623-634: Use explicit skip decorator for precision-problematic test case.

Line 628 has a commented-out float8_e5m2fnuz test case. Use an explicit skip decorator with tracking issue reference instead.

testing/python/language/test_tilelang_language_atomic_add.py (1)

198-200: Consider aligning default dtype with other atomic functions.

While the dtype parameterization is correct, the default value T.float16 differs from other atomic operation functions in this file (e.g., atomic_add_program, atomic_max_program), which default to T.float32. This inconsistency might cause confusion, though it could be intentional given that atomic_addx2 is typically used with FP16 pairs.

🔎 Proposed alignment
-def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16):
+def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float32):

And similarly for run_atomic_addx2:

-def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float16):
+def run_atomic_addx2(M, N, block_M, block_N, dtype=T.float32):
testing/python/language/test_tilelang_language_ptr.py (1)

49-52: Consider adding CUDA availability check.

The test allocates tensors on CUDA device without checking if CUDA is available. If this test is intended to run only on CUDA-capable systems, consider adding the @tilelang.testing.requires_cuda decorator to test_matmul() (line 61), consistent with the gating pattern mentioned in the PR summary.

🔎 Proposed fix to add CUDA gating

Add the decorator to the test function:

+@tilelang.testing.requires_cuda
 def test_matmul():
     run_matmul(1024, 1024, 1024, 128, 128, 32)
src/tl_templates/hip/atomic.h (1)

1-104: Acknowledge the memory_order parameter limitation.

All atomic functions accept a memory_order parameter but ignore it, as documented in the comments. While a past review flagged this inconsistency with CUDA (which conditionally applies memory ordering), it was marked as addressed in commits 14067c3 to d592f8b.

The current approach—keeping the parameter for lowering compatibility while ignoring it—is pragmatic if HIP lacks equivalent memory ordering semantics. However, consider documenting at the file level why HIP cannot support explicit memory ordering (e.g., lack of equivalent HIP atomic variants or fence semantics) to prevent future confusion.

📝 Suggested file-level documentation

Add a comment at the top of the file explaining the memory_order limitation:

 #pragma once
 
 #include <hip/hip_runtime.h>
+
+// Note: All atomic functions accept a memory_order parameter to match the
+// lowering interface, but HIP atomics do not support explicit memory ordering
+// semantics equivalent to CUDA's acquire/release/seq_cst variants. The parameter
+// is ignored and all operations use HIP's default atomic behavior. Use explicit
+// fences (__threadfence_block, __threadfence, __threadfence_system) if stronger
+// ordering guarantees are required.
 
 // Add an extra unused input to accommodate the additional 'memory_order'
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e0a865 and 92a330d.

📒 Files selected for processing (54)
  • .github/workflows/ci.yml
  • src/op/gemm.cc
  • src/op/logical.cc
  • src/target/codegen_hip.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/atomic.h
  • src/tl_templates/hip/common.h
  • src/tl_templates/hip/debug.h
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/reduce.h
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • tilelang/engine/lower.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/issue/test_tilelang_issue_830.py
🚧 Files skipped from review as they are similar to previous changes (21)
  • src/op/gemm.cc
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • tilelang/engine/lower.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • src/tl_templates/hip/debug.h
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • .github/workflows/ci.yml
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/language/test_tilelang_language_unroll.py
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/language/test_tilelang_language_alloc.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • tilelang/intrinsics/mfma_layout.py
  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (10)
tilelang/intrinsics/mfma_layout.py (1)
tilelang/language/v2/builder.py (1)
  • const (833-844)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
  • run (121-181)
  • run (188-259)
tilelang/intrinsics/mfma_macro_generator.py (1)
tilelang/language/v2/dtypes.py (1)
  • int8 (243-243)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (5)
tilelang/language/v2/dtypes.py (8)
  • float16 (299-299)
  • float32 (300-300)
  • float8_e5m2 (355-355)
  • float8_e5m2fnuz (362-362)
  • float8_e4m3fnuz (348-348)
  • int8 (243-243)
  • int32 (245-245)
  • bfloat16 (397-397)
tilelang/tileop/gemm/gemm_base.py (6)
  • M (35-36)
  • N (39-40)
  • K (43-44)
  • trans_A (47-48)
  • trans_B (51-52)
  • in_dtype (55-57)
tilelang/language/kernel.py (1)
  • num_threads (221-225)
testing/python/amd/test_tilelang_test_amd.py (1)
  • run_gemm_rs (195-239)
maint/gemm_v2/correctness_evaluation.py (3)
  • run_gemm_rs (178-210)
  • run_gemm_sr (262-295)
  • run_gemm_rr (350-383)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/atomic.py (1)
  • atomic_addx2 (230-265)
tilelang/language/proxy.py (1)
  • Tensor (232-232)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
  • tl (110-200)
  • shfl_down (138-140)
  • shfl_down (157-161)
  • shfl_down (182-186)
  • shfl_xor (134-136)
  • shfl_xor (151-155)
  • shfl_xor (176-180)
  • shfl (146-148)
  • shfl (169-173)
  • shfl (194-198)
  • shfl_up (142-144)
  • shfl_up (163-167)
  • shfl_up (188-192)
testing/python/issue/test_tilelang_issue_96.py (2)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
tilelang/jit/kernel.py (1)
  • out_idx (609-610)
src/tl_templates/hip/common.h (2)
src/tl_templates/cuda/common.h (5)
  • tl (248-342)
  • tl (571-573)
  • tl (574-576)
  • tl (591-671)
  • bfloat16_t (581-582)
src/tl_templates/cuda/atomic.h (1)
  • half_t (24-26)
🔇 Additional comments (63)
testing/python/language/test_tilelang_language_let.py (1)

6-6: Verify CUDA-only gating aligns with ROCm CI objectives.

The PR title indicates the goal is to "Open Rocm ci test," but this decorator restricts the test to CUDA-only environments. Given that line 18 hardcodes target="cuda" and line 19 checks for CUDA-specific output ("float4 b"), this gating may be intentional if the test exercises CUDA-specific features not yet supported in ROCm.

Please confirm whether:

  1. This test is intentionally excluded from ROCm CI, or
  2. A ROCm-compatible version of this test should be added alongside this CUDA-gated one.
testing/python/issue/test_tilelang_issue_96.py (1)

40-40: LGTM! Change enables multi-backend support.

Removing the explicit target="cuda" parameter allows the compilation target to be auto-detected or controlled via the TILELANG_TARGET environment variable, which is essential for ROCm CI support. The compile function will default to "auto" when no target is specified, enabling the test to run on both CUDA and ROCm systems.

Note: The PyTorch device="cuda" on lines 43-44 is correct, as PyTorch uses the "cuda" device identifier for both NVIDIA CUDA and AMD ROCm backends.

testing/python/language/test_tilelang_language_var_init.py (1)

6-7: Appropriate CUDA gating for unsupported HIP feature.

The decorator correctly restricts this test to CUDA environments given that variable initialization is not yet supported on HIP.

testing/python/jit/test_tilelang_jit_cutedsl.py (1)

158-158: LGTM! CUDA gating is appropriate for CuTeDSL backend tests.

The @tilelang.testing.requires_cuda decorator additions are correct for these CuTeDSL-specific tests. Since CuTeDSL is a CUDA backend and all tests invoke .cuda() operations, gating them behind CUDA availability enables ROCm CI to skip these gracefully.

Also applies to: 210-210, 253-253, 303-303

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

158-158: LGTM! CUDA gating is appropriate for NVRTC backend tests.

The @tilelang.testing.requires_cuda decorator additions are correct for these NVRTC-specific tests. Since NVRTC (NVIDIA Runtime Compilation) is a CUDA-only backend and all tests invoke .cuda() operations, gating them behind CUDA availability enables ROCm CI to skip these gracefully. The Hopper-specific test at line 375 correctly uses layered gating with both requires_cuda and check_hopper().

Also applies to: 210-210, 253-253, 303-303, 375-375, 389-389

testing/python/cache/test_tilelang_kernel_cache.py (1)

121-121: Correct decorator usage for GPU-executing test.

This test appropriately has the @tilelang.testing.requires_cuda decorator since it executes kernels on GPU tensors (lines 187-188, 190-191).

testing/python/jit/test_tilelang_jit_gemm_cython.py (1)

169-169: LGTM: Accumulation dtype upgrade improves numerical stability.

Changing the accumulation dtype from T.float16 to T.float32 across all GEMM tests is a best practice that reduces accumulated rounding errors during matrix multiplication. This is especially important when using mixed-precision computation (float16 inputs/outputs with float32 accumulation) and aligns with the reference implementation which uses torch.float (float32) for computation.

The existing test tolerances (atol=1e-2, rtol=1e-2) remain appropriate and should actually see improved accuracy with this change.

Also applies to: 211-211, 255-255, 304-308, 357-357

testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)

199-199: LGTM! Appropriate CUDA gating for Tensor Core operations.

The decorator correctly gates this test since it relies on NVIDIA-specific INT4 Tensor Core MMA operations, ldmatrix/stmatrix intrinsics, and CUDA device tensors.


403-403: LGTM! Correct decorator stacking for multi-requirement test.

The CUDA gating decorator is properly stacked with existing bitblas package and LLVM requirements, ensuring the test only runs when all dependencies are available.

testing/python/language/test_tilelang_language_alloc.py (4)

36-36: Clarify the acceptance of two variable allocation patterns.

The assertion now accepts either "tmp =" or "tmp[0] =" in the emitted code. This suggests different code generation between backends (likely CUDA vs ROCm).

Please clarify:

  • Is "tmp[0] =" an expected difference in how ROCm/HIP represents scalar variables compared to CUDA?
  • Could this mask a potential code generation issue where scalar variables are incorrectly treated as arrays?

Consider adding a comment explaining why both patterns are valid, or verify that the backend-specific pattern is correct.

The same concern applies to line 76.


118-120: Temporary ROCm gating is appropriate.

The explicit CUDA-only gating with a TODO comment is a reasonable approach for incrementally enabling ROCm support. This allows the CI to run on CUDA while tracking tests that need ROCm enablement.

Consider tracking these TODOs in a tracking issue to ensure they're addressed systematically.


154-156: Verify consistency of kernel source retrieval across tests.

This test uses kernel.get_kernel_source(kernel_only=True) while the tests at lines 35 and 75 use kernel.get_kernel_source() without the kernel_only parameter.

Please clarify:

  • Is kernel_only=True necessary here due to the specific counting logic (counting occurrences of "= 1;" and "= 2;")?
  • Should the earlier tests (lines 35, 75) also use kernel_only=True for more precise assertions?

Consistent usage across similar tests improves maintainability.


159-161: Temporary ROCm gating is appropriate.

Same approach as the previous gated test. The consistent pattern of TODO comments and CUDA-only gating facilitates incremental ROCm enablement.

testing/python/language/test_tilelang_language_infinity.py (1)

3-3: LGTM! Essential import bug fix.

This import was missing despite the code using tilelang.testing.requires_cuda (line 24) and tilelang.testing.main() (line 33). Without this explicit import, the test would fail with an AttributeError at runtime.

testing/python/language/test_tilelang_language_vectorize.py (1)

127-127: Verify the intent behind CUDA-only gating in a ROCm enablement PR.

The decorator correctly gates this test given the hardcoded device="cuda" on line 146. However, since this PR aims to enable ROCm CI support, consider whether this test could be updated to support both CUDA and ROCm devices (e.g., parameterizing the device) rather than restricting it to CUDA-only.

If there's a technical reason this test must remain CUDA-only (e.g., certain FP8 dtypes tested here have different support on ROCm), please clarify the reasoning.

Optional: Parameterize device for ROCm support

If ROCm support is intended for this test, you could parameterize the device:

 @tilelang.testing.requires_cuda
 @pytest.mark.parametrize(
     "dtype",
     [
         torch.uint8,
         ...
     ],
 )
 @pytest.mark.parametrize("vec_num", [1, 2, 4, 8])
-def test_vectorize_all_dtypes(dtype, vec_num):
-    x = torch.empty((64,), dtype=dtype, device="cuda")
+@pytest.mark.parametrize("device", ["cuda"])  # Add "hip" when ROCm support is ready
+def test_vectorize_all_dtypes(dtype, vec_num, device):
+    x = torch.empty((64,), dtype=dtype, device=device)
     kernel = vectorize_test_all_dtypes(dtype, vec_num)
     kernel(x)

Then update the decorator to @tilelang.testing.requires_gpu (if available) instead of CUDA-specific.

testing/python/jit/test_tilelang_jit_tvm_ffi.py (4)

167-167: LGTM: Improved numerical stability with float32 accumulation.

Switching from float16 to float32 accumulation while keeping float16 input/output is a standard mixed-precision pattern that improves numerical accuracy without significantly impacting memory usage.


210-210: LGTM: Consistent accumulation dtype across test suite.

The float32 accumulation dtype change is consistently applied across benchmarking and multi-stream tests, ensuring uniform numerical behavior.

Also applies to: 252-252


301-301: LGTM: Float32 accumulation applied to dynamic shape tests.

The accumulation dtype change is correctly applied across all dynamic shape test variants, maintaining consistency with static shape tests.

Also applies to: 303-303, 306-306


386-386: LGTM: Correct CUDA-only gating for L2 persistent cache test.

The decorator is appropriate since the test validates CUDA-specific L2 persistent cache APIs (__tvm_cuda_stream_set_access_policy_window_packed). This feature is not available on ROCm.

testing/python/carver/test_tilelang_carver_recommend_hints.py (1)

136-136: Consider whether @tilelang.testing.requires_cuda decorator appropriately gates FMHA tests given ROCm enablement goals.

The FlashAttentionTemplate implementation uses architecture-agnostic TVM APIs (get_tensorized_func_and_tags with self.arch.target) and the test calls auto_infer_current_arch(), which should work on both CUDA and ROCm. However, this test is CUDA-gated while similar templates in the same file—MatmulTemplate, GEMVTemplate, ElementwiseTemplate, and GeneralReductionTemplate—are not, making the restriction inconsistent.

Verify whether:

  1. FlashAttentionTemplate has runtime or compiler-level ROCm limitations not visible in the template code
  2. The decorator should be removed or replaced with separate ROCm and CUDA tests following the pattern shown in test_tilelang_tilelibrary_gemm.py
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (6)

114-118: LGTM! Float32 accumulation improves numerical stability.

The addition of T.float32 as dtypeAccum for float16 tests follows best practices for preventing overflow and maintaining precision during accumulation.


271-275: LGTM! Consistent accumulation dtype usage.

The float32 accumulation for float16 inputs is correctly applied to the RS variant tests.


427-431: LGTM! Float32 accumulation correctly applied.


448-456: Verify SR FP8 stability on CUDA.

Unlike the SS and RS variants (which are skipped), this SR FP8 CUDA test is enabled. Confirm that the SR variant is stable enough for FP8 on CUDA, or if this should also be skipped until GEMM issues are resolved.


590-593: LGTM! Float32 accumulation correctly applied.


612-620: Verify RR FP8 stability on CUDA.

Like the SR variant, this RR FP8 CUDA test is enabled while SS and RS are skipped. Confirm that the RR variant has sufficient stability for FP8 on CUDA.

testing/python/language/test_tilelang_language_frontend_v2.py (1)

322-323: LGTM! Appropriate workaround for ROCm limitation.

Gating this test behind CUDA availability is the right approach given that ROCm doesn't currently support alloc_var with initializer (line 329 uses T.alloc_var(T.int32, 0)). The TODO comment clearly documents the limitation. Consider tracking this as a formal issue to ensure ROCm support for alloc_var with initializer is added in the future.

src/op/logical.cc (2)

53-54: LGTM! Consistent HIP support addition.

The HIP intrinsic registration follows the same pattern as tl.any_of above, correctly reusing the lowering function for both CUDA and HIP backends. This maintains consistency across the logical operations.


45-46: HIP runtime implementations confirmed to exist.

The registration correctly adds HIP support alongside CUDA, reusing the same lowering function. Both tl::Any and tl::All are properly implemented as device templates in src/tl_templates/hip/common.h, ensuring external symbol resolution at link time.

testing/python/jit/test_tilelang_jit_callback.py (4)

4-4: LGTM!

The import correctly extends callback support to include HIP alongside CUDA.


93-96: LGTM! HIP callback mirrors the CUDA pattern.

The implementation correctly follows the CUDA postproc callback structure. Note that like the CUDA callback, this registration will persist globally after the function executes.

Based on learnings, the persistent callback behavior is intentional.


232-232: Verify the accumulation dtype change in the active test.

This test actively runs and validates numerical correctness. The change from T.float16 to T.float32 for accumulation improves numerical precision but could indicate:

  1. A bug fix for numerical accuracy issues
  2. A HIP-specific requirement
  3. A general best practice change

Please confirm:

  • Is this change necessary for HIP support, or is it a general fix?
  • Have you verified the test passes with this change on both CUDA and HIP (if applicable)?
  • Does this align with the "fix some bugs" mentioned in the PR title?

117-117: No action needed — accumulation dtype change aligns with codebase patterns.

The change to T.float32 for accumulation dtype is consistent with standard mixed-precision design used throughout the test suite (float16 inputs with float32 accumulation for numerical stability). This pattern is used in all similar tests and applies to both CUDA and HIP paths equally. Since the test is skipped, this change has no current impact.

testing/python/jit/test_tilelang_jit_parcompile.py (1)

61-62: The block_K configurations are intentional and mathematically sound. Both configurations maintain constant shared memory usage (~32KB):

  • Config 1 (1024³): block_K=64 → A_shared(128×64) + B_shared(64×128) = 32KB
  • Config 2 (2048³): block_K=32 → A_shared(256×32) + B_shared(32×256) = 32KB

As matrix dimensions double, block_K is halved to maintain the shared memory constraint—a critical optimization for ROCm devices. The commit message "Fix some bugs on ci and open rocm ci test" confirms this is a deliberate bugfix, not an accidental swap. No changes needed.

testing/python/language/test_tilelang_language_atomic_add.py (3)

238-240: LGTM!

The @tilelang.testing.requires_cuda decorator appropriately gates this CUDA-specific atomic operation test.


243-245: LGTM!

The @tilelang.testing.requires_cuda decorator correctly gates this CUDA-specific test for atomic operations with memory ordering.


353-357: LGTM!

The @tilelang.testing.requires_cuda decorator properly gates this test, and the addition of multiple dtype tests (float32, float16, bfloat16) improves test coverage for atomic operations with different memory orderings.

testing/python/language/test_tilelang_language_ptr.py (1)

44-44: The code is correct and does not require changes. The cython execution backend is explicitly supported for CUDA targets (verified in tilelang/jit/execution_backend.py::allowed_backends_for_target() where cuda allows ["tvm_ffi", "nvrtc", "cython"]). Target autodetection with determine_target("auto") explicitly detects CUDA availability via torch.cuda.is_available() and returns the appropriate CUDA target with device capability. The CythonKernelAdapter is documented as converting TIR functions to compiled CUDA libraries.

Likely an incorrect or invalid review comment.

testing/python/debug/test_tilelang_debug_print.py (3)

2-2: LGTM!

The pytest import is necessary for the parameterized test decorator introduced later in the file.


20-24: LGTM!

The parameterized test refactoring consolidates multiple dtype test cases into a single, maintainable test. Platform-specific FP8 types are appropriately tested separately with conditional decorators.


50-50: No actionable changes needed; tests properly support platform-agnostic GPU compilation.

The removal of explicit target="cuda" correctly enables ROCm CI testing. The tests contain GPU-specific operations (T.Kernel, T.alloc_shared, T.alloc_fragment) and rely on the default target="auto", which uses determine_target() to automatically select CUDA or HIP based on available hardware. Importantly, if neither CUDA nor HIP is available, the function raises ValueError—tests cannot silently fall back to CPU execution. The change properly aligns with enabling cross-platform GPU testing without requiring conditional decorators.

src/tl_templates/hip/hip_fp8.h (1)

1-1: LGTM! Include guard added.

The addition of #pragma once follows standard practice for header files and prevents multiple inclusion issues.

tilelang/jit/adapter/wrapper.py (3)

314-316: Good refactor to extract declaration parsing logic.

Moving the declaration extraction into a dedicated method improves code organization and maintainability.


612-615: Appropriate HIP-specific declaration parsing.

The HIP implementation correctly splits on "{" instead of ";" to handle HIP's kernel declaration syntax, which goes directly from signature to function body without a semicolon-terminated forward declaration.


585-585: Both type mappings are actively used and necessary.

The additions "float8_e5m2fnuz": "fp8_e5_t" and "uint64": "uint64_t" are legitimately needed for HIP kernel generation:

  • float8_e5m2fnuz is a first-class dtype with extensive kernel support (GEMM kernels, MMA/MFMA intrinsic generators, test coverage with actual computations)
  • uint64 is used during kernel argument generation for tensor descriptor dimensions and strides, which are always 64-bit in GPU addressing

Both types are looked up via _lookup_type() when building kernel function signatures.

src/tl_templates/cuda/reduce.h (3)

185-186: Good refactor to parameterize segment size.

Replacing hardcoded 32 with the SEG template parameter improves flexibility and aligns with the HIP implementation.


190-191: Good addition of boundary check.

The early exit if (gRow >= H) return; prevents out-of-bounds memory access when the global row index exceeds the height dimension.


178-179: The function signature at lines 178-179 shows void as the return type, which is consistent with CumSum1D (line 110) and the HIP implementation (line 188-189). This is not a breaking change—the function has always returned void, writing results to the dst parameter. The call site in src/op/reduce.cc:559 correctly handles this by passing dstPtr as a parameter and wrapping the result in Evaluate(), which is the correct pattern for void external function calls. No action is required.

Likely an incorrect or invalid review comment.

tilelang/intrinsics/mfma_layout.py (1)

3-3: No changes needed—const(0) usage is correct.

The change from convert(0) to const(0) on lines 9 and 20 is semantically appropriate. These layout functions return TIR expressions, and tvm.tir.const() is the correct API for creating compile-time integer constants in that context. The remaining convert() call at line 38 correctly uses convert() for its intended purpose of converting a Python list to a TVM runtime object. No other convert(0) patterns exist in the intrinsics code.

tilelang/intrinsics/mfma_macro_generator.py (3)

53-53: LGTM! FP8 e5m2fnuz abbreviation mapping is consistent.

The addition of "float8_e5m2fnuz": "e5m2fnuz" follows the same pattern as the existing "float8_e4m3fnuz" mapping.


111-111: LGTM! FP8 e5m2fnuz k_dim handling is correct.

The new check correctly treats "float8_e5m2fnuz" the same as "float8_e4m3fnuz" and T.int8, setting k_dim = 32 as expected for 8-bit types in MFMA operations.


145-145: LGTM! FP8 e5m2fnuz MFMA prefix mapping is consistent.

The mapping of "float8_e5m2fnuz": "fp8" correctly groups it with "float8_e4m3fnuz" for MFMA suffix generation, ensuring both FP8 variants use the same fp8_fp8 suffix pattern.

src/target/codegen_hip.cc (2)

945-946: LGTM! FP8 e5m2fnuz type mappings are consistent.

The new mappings for float8_e5m2fnuzx4 and float8_e5m2fnuzx8 correctly follow the same pattern as the existing e4m3fnuz mappings (lines 943-944), ensuring uniform FP8 type handling in MFMA operations.


985-994: LGTM! Warp reduction operations are correctly integrated.

The new warp reduction operations properly delegate to the corresponding tl::warp_reduce_* functions defined in src/tl_templates/hip/reduce.h (lines 273-291), following the established pattern for HIP builtin operations.

src/tl_templates/hip/common.h (4)

3-5: LGTM! Required headers added for HIP warp and atomic operations.

The new includes for atomic.h and hip/amd_detail/amd_warp_functions.h provide the necessary infrastructure for the warp shuffle and atomic operations introduced in the tl namespace.


110-130: LGTM! Any and All helper functions are correctly implemented.

The tl::Any and tl::All helper functions provide simple, clear array predicates that will be useful for reduction and warp-level logic.


134-148: LGTM! Generic shfl wrappers correctly delegate to HIP intrinsics.

The generic tl::shfl_xor, tl::shfl_down, tl::shfl_up, and tl::shfl wrappers properly delegate to the corresponding HIP intrinsics. The TODO comment appropriately notes that HIP/ROCm doesn't yet provide shfl_sync variants (available in ROCm 7.1.1+).


151-198: LGTM! FP16/BF16 shfl specializations correctly handle type conversions.

The specializations for half_t (lines 151-173) and bfloat16_t (lines 176-198) correctly convert to float before shuffling and convert back afterward. This matches the pattern used in the CUDA version and ensures correct behavior for these types that don't have native HIP shuffle support.

src/tl_templates/hip/reduce.h (5)

76-76: LGTM! Correctly uses type-safe tl::shfl_down wrapper.

The change from __shfl_down to tl::shfl_down ensures correct behavior for half_t and bfloat16_t types by using the specializations defined in common.h.


107-107: LGTM! Correctly uses type-safe tl::shfl_xor wrapper.

The change from __shfl_xor to tl::shfl_xor ensures correct behavior for half_t and bfloat16_t types by using the specializations defined in common.h.


117-182: LGTM! CumSum1D implementation correctly uses tl::shfl wrappers.

The CumSum1D implementation properly uses tl::shfl_down (line 142), tl::shfl_up (line 165), and tl::shfl (lines 152, 155, 175, 178) throughout, ensuring type safety for all supported types including half_t and bfloat16_t.


184-260: LGTM! CumSum2D implementation correctly uses tl::shfl wrappers.

The CumSum2D implementation properly uses tl::shfl_down (line 216), tl::shfl_up (line 242), and tl::shfl (lines 226, 229, 252, 255) throughout, ensuring type safety for all supported types including half_t and bfloat16_t.


273-291: LGTM! Warp reduction wrappers are correctly implemented.

The warp_reduce_sum, warp_reduce_max, warp_reduce_min, warp_reduce_bitand, and warp_reduce_bitor functions properly delegate to warp_reduce with the appropriate operation. Once the warp_reduce function is fixed to use tl::shfl_xor, these wrappers will work correctly for all types.

Comment on lines +214 to +215
A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check tensor initialization patterns in the test file
echo "=== Tensor initialization patterns in test_tilelang_language_atomic_add.py ==="
rg -n -B1 -A2 "torch\.(randn|zeros|arange)" testing/python/language/test_tilelang_language_atomic_add.py | head -80

Repository: tile-ai/tilelang

Length of output: 2400


🏁 Script executed:

#!/bin/bash
# Identify which functions contain the anomalous initialization patterns
echo "=== Function containing lines 214-215 (two-step .to() conversion) ==="
sed -n '200,220p' testing/python/language/test_tilelang_language_atomic_add.py

echo -e "\n=== Function containing lines 309-310 (float32 only, no conversion) ==="
sed -n '300,320p' testing/python/language/test_tilelang_language_atomic_add.py

Repository: tile-ai/tilelang

Length of output: 1627


Align tensor initialization with standard pattern used elsewhere in the file.

The tensor initialization in lines 214-215 uses a two-step approach (float32 → conversion), but this is inconsistent with the standard pattern throughout the file. The run_atomic_addx2 function accepts a dtype parameter yet doesn't use it directly. All other test functions with dtype parameters (e.g., run_atomic_add, run_atomic_add_builtin) create tensors directly with the target dtype: torch.randn(..., dtype=getattr(torch, dtype)).cuda(). Update lines 214-215 to follow this consistent approach, or add a comment explaining why the two-step conversion is necessary.

🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines
214-215, the tensors A and B are created as float32 then converted to the target
dtype, which is inconsistent with the rest of the file; change the
initialization to construct the tensors directly with the target dtype (e.g.,
torch.randn(M, N, dtype=getattr(torch, dtype)).cuda() and torch.zeros(M, N,
dtype=getattr(torch, dtype)).cuda()), or if there is a deliberate reason to
create as float32 first, add a brief comment explaining why the two-step
conversion is necessary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
testing/python/language/test_tilelang_language_annot.py (1)

25-27: Tests hardcode "cuda" device despite HIP workaround.

The workaround at lines 8-10 is specifically for HIP/ROCm, but the test hardcodes device="cuda". This creates two problems:

  1. The test will fail on ROCm-only systems where the "cuda" device doesn't exist
  2. The HIP-specific workaround won't actually be validated on HIP hardware

Since this PR aims to "Open Rocm ci test," consider parameterizing the device or using a helper that selects the appropriate device (cuda/hip) at runtime.

🔎 Example fix using device detection
+    # Detect available device
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    
     ker = example_tensor_annot()
-    A = torch.arange(16, dtype=torch.int32, device="cuda")
+    A = torch.arange(16, dtype=torch.int32, device=device)
     ker(A)
-    expected = torch.zeros(16, dtype=torch.int32, device="cuda")
+    expected = torch.zeros(16, dtype=torch.int32, device=device)

Note: PyTorch on ROCm typically exposes devices through torch.cuda API, so this should work. Verify that the test framework or tilelang provides a device selector if needed.

♻️ Duplicate comments (11)
testing/python/language/test_tilelang_language_annot.py (2)

32-34: Same issues as test_tensor_annot_mul.

This test has the same problems identified in the first test:

  1. Missing issue reference in the comment (lines 32-33)
  2. Hardcoded "cuda" device (lines 49, 51) despite the HIP-specific workaround

Please apply the same fixes as suggested for test_tensor_annot_mul.

Also applies to: 49-51


56-58: Same issues as previous tests.

This test has the same problems:

  1. Missing issue reference in the comment (lines 56-57)
  2. Hardcoded "cuda" device (lines 73, 75) despite the HIP-specific workaround

Please apply the same fixes as suggested for the other test functions.

Also applies to: 73-75

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (3)

133-156: Skip reason lacks issue tracking.

The skip decorators use generic reasons without referencing tracking issues for the known GEMM SS problems. This was already flagged in the previous review.


290-313: Skip reason lacks issue tracking.

The skip decorators use generic reasons without referencing tracking issues. This was already flagged in the previous review.


612-634: Document the intentional FP8 test gating strategy and create tracking for precision issues.

The FP8 test gating is intentional: SS and RS variants are explicitly skipped with reasons ("Temporarily disabling until GEMM SS/RS issues are resolved"), while SR and RR variants are enabled. This pattern should be documented in a test strategy comment or PR description to clarify that the inconsistency is by design.

Additionally, the TODO comments referencing float8_e5m2fnuz precision problems (in both SR and RR ROCm tests) lack issue tracking. Create GitHub issues or add issue numbers to these comments for better visibility and tracking.

testing/python/cache/test_tilelang_kernel_cache.py (2)

199-201: Remove CUDA gating from compilation-only test.

This test only compiles kernels without executing them (lines 222-227, 240-245). Given the PR objective to "Open Rocm ci test," this decorator incorrectly skips the test on ROCm CI. Remove @tilelang.testing.requires_cuda to allow cache behavior testing across all platforms.

Based on previous review: compilation-only tests should run on all platforms to maximize CI coverage.


250-252: Remove CUDA gating from compilation-only test.

This test only compiles a kernel (lines 274-279) without GPU execution. The @tilelang.testing.requires_cuda decorator unnecessarily restricts ROCm CI coverage. Remove it to align with the PR goal of opening ROCm CI tests.

Based on previous review: compilation-only tests should run on all platforms.

testing/python/language/test_tilelang_language_atomic_add.py (2)

214-215: Align tensor initialization with standard pattern used elsewhere in the file.

The tensor initialization uses a two-step approach (float32 → conversion), which is inconsistent with the standard pattern throughout the file. All other test functions create tensors directly with the target dtype: torch.randn(..., dtype=getattr(torch, dtype)).cuda().

🔎 Proposed fix
-    A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
-    B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
+    A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda()
+    B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()

253-254: Add missing @requires_cuda decorator for consistency.

test_atomic_addx2_float is missing the @tilelang.testing.requires_cuda decorator, while test_atomic_addx2_half (Line 248) has it. Both tests use the same underlying atomic_addx2 operation, which is CUDA-specific. This inconsistency will cause test_atomic_addx2_float to fail on non-CUDA platforms.

🔎 Proposed fix
+@tilelang.testing.requires_cuda
 def test_atomic_addx2_float():
     run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)
src/tl_templates/hip/reduce.h (1)

262-271: Use tl::shfl_xor instead of raw __shfl_xor for type safety.

The warp_reduce function uses raw HIP intrinsics (__shfl_xor) which will produce incorrect results or fail for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper (defined in common.h lines 148-177) that converts to float before shuffling.

This is inconsistent with the rest of the file where tl::shfl_* wrappers are correctly used (e.g., lines 76, 107, 142, 165, etc.).

🔎 Proposed fix
 template <typename T, typename ReduceOp>
 TL_DEVICE T warp_reduce(T value, ReduceOp op) {
-  value = op(value, __shfl_xor(value, 32));
-  value = op(value, __shfl_xor(value, 16));
-  value = op(value, __shfl_xor(value, 8));
-  value = op(value, __shfl_xor(value, 4));
-  value = op(value, __shfl_xor(value, 2));
-  value = op(value, __shfl_xor(value, 1));
+  value = op(value, tl::shfl_xor(value, 32));
+  value = op(value, tl::shfl_xor(value, 16));
+  value = op(value, tl::shfl_xor(value, 8));
+  value = op(value, tl::shfl_xor(value, 4));
+  value = op(value, tl::shfl_xor(value, 2));
+  value = op(value, tl::shfl_xor(value, 1));
   return value;
 }
src/tl_templates/hip/debug.h (1)

40-49: Missing unsigned long long specialization.

The trait defines long long but not unsigned long long. This could cause the fallback template to be used for uint64_t values, printing them as pointers instead of their actual values.

🔎 Proposed fix
 DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
+DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
🧹 Nitpick comments (5)
testing/python/language/test_tilelang_language_ptr.py (1)

61-62: Consider adding CUDA availability check.

The test creates CUDA tensors but lacks a CUDA availability decorator (e.g., @tilelang.testing.requires_cuda). The AI summary indicates other test files in this PR are receiving similar gating decorators for backend availability. Adding this decorator would prevent test failures when CUDA is unavailable.

🔎 Suggested decorator
+@tilelang.testing.requires_cuda
 def test_matmul():
     run_matmul(1024, 1024, 1024, 128, 128, 32)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)

432-441: Document precision problems with issue tracking.

The TODO comments about ROCm precision problems should be tracked in dedicated issues. Commented-out tests indicate known instabilities that need formal tracking for resolution and re-enablement.

Consider creating issues for:

  1. ROCm int8 precision problem (lines 432-434)
  2. ROCm float precision problem (lines 439-441)

595-597: Document ROCm precision problems with issue tracking.

The TODO comment about ROCm precision problems with num_stages=2 should be tracked in a dedicated issue for proper resolution planning.

tilelang/engine/lower.py (2)

73-74: Consider validating include paths before use.

The direct concatenation of include paths could produce invalid compiler flags if the constants are None or empty strings. Consider adding validation to ensure paths are valid before passing them to the compiler.

🔎 Suggested validation approach
+    # Validate required include paths
+    if not TILELANG_TEMPLATE_PATH or not CUTLASS_INCLUDE_DIR:
+        raise ValueError("Required CUDA include paths not configured. Ensure TILELANG_TEMPLATE_PATH and CUTLASS_INCLUDE_DIR are set.")
+
     options = [
         "-std=c++17",
         "-I" + TILELANG_TEMPLATE_PATH,
         "-I" + CUTLASS_INCLUDE_DIR,
     ]

120-121: Consider validating include paths before use.

Similar to the CUDA path, validate that HIP include paths are properly configured before passing them to the compiler.

🔎 Suggested validation approach
+    # Validate required include paths
+    if not TILELANG_TEMPLATE_PATH or not COMPOSABLE_KERNEL_INCLUDE_DIR:
+        raise ValueError("Required HIP include paths not configured. Ensure TILELANG_TEMPLATE_PATH and COMPOSABLE_KERNEL_INCLUDE_DIR are set.")
+
     hsaco = hipcc.compile_hip(
         code,
         target_format="hsaco",
         options=[
             "-std=c++17",
             "-I" + TILELANG_TEMPLATE_PATH,
             "-I" + COMPOSABLE_KERNEL_INCLUDE_DIR,
         ],
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b25c7dc and 16c4dc4.

📒 Files selected for processing (54)
  • .github/workflows/ci.yml
  • src/op/gemm.cc
  • src/op/logical.cc
  • src/target/codegen_hip.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/atomic.h
  • src/tl_templates/hip/common.h
  • src/tl_templates/hip/debug.h
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/reduce.h
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • tilelang/engine/lower.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_composable_index.py
✅ Files skipped from review due to trivial changes (1)
  • src/tl_templates/hip/hip_fp8.h
🚧 Files skipped from review as they are similar to previous changes (21)
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/debug/test_device_assert.py
  • testing/python/language/test_tilelang_language_infinity.py
  • .github/workflows/ci.yml
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • tilelang/intrinsics/mfma_layout.py
  • src/op/gemm.cc
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_callback.py
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • tilelang/engine/lower.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_ptr.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_atomic_add.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • tilelang/engine/lower.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/tl_templates/hip/reduce.h
🧬 Code graph analysis (13)
testing/python/language/test_tilelang_language_mask_op.py (2)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
tilelang/jit/kernel.py (1)
  • out_idx (609-610)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
testing/python/debug/test_tilelang_debug_print.py (1)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
  • run (121-181)
  • run (188-259)
testing/python/kernel/test_tilelang_kernel_gemm_simt.py (1)
testing/python/kernel/test_tilelang_kernel_fp8_gemm_mma.py (1)
  • assert_tl_matmul_correctness (183-218)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
  • tl (110-200)
  • shfl_down (138-140)
  • shfl_down (157-161)
  • shfl_down (182-186)
  • shfl_xor (134-136)
  • shfl_xor (151-155)
  • shfl_xor (176-180)
  • shfl (146-148)
  • shfl (169-173)
  • shfl (194-198)
  • shfl_up (142-144)
  • shfl_up (163-167)
  • shfl_up (188-192)
tilelang/jit/adapter/wrapper.py (1)
tilelang/layout/layout.py (1)
  • index (46-55)
src/tl_templates/hip/common.h (2)
src/tl_templates/cuda/common.h (5)
  • tl (248-342)
  • tl (571-573)
  • tl (574-576)
  • tl (591-671)
  • bfloat16_t (581-582)
src/tl_templates/cuda/atomic.h (1)
  • half_t (24-26)
src/tl_templates/hip/debug.h (5)
src/tl_templates/hip/gemm.h (3)
  • void (158-232)
  • void (234-291)
  • bfloat16_t (33-52)
tilelang/language/print_op.py (1)
  • print_var (15-25)
tilelang/layout/layout.py (1)
  • index (46-55)
src/tl_templates/cuda/cuda_fp4.h (1)
  • float (195-197)
src/tl_templates/cuda/reduce.h (2)
  • half_t (17-19)
  • bfloat16_t (20-22)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
testing/python/language/test_tilelang_language_alias.py (1)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
testing/python/language/test_tilelang_language_atomic_add.py (2)
tilelang/language/v2/dtypes.py (2)
  • float16 (299-299)
  • float32 (300-300)
tilelang/language/atomic.py (1)
  • atomic_addx2 (230-265)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (71)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)

167-167: LGTM: Accumulation dtype upgrade improves numerical stability.

The consistent change from T.float16 to T.float32 for accumulation dtype across all matmul tests follows best practices for mixed-precision kernels. Using float32 accumulation with float16 inputs/outputs prevents precision loss and overflow during iterative accumulation operations, which is particularly important for matrix multiplications.

Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306


386-386: LGTM: Appropriate CUDA-only gating for L2 persistent cache.

The @tilelang.testing.requires_cuda decorator correctly gates this test since L2 persistent cache functionality relies on CUDA-specific APIs (__tvm_cuda_stream_set_access_policy_window_packed and __tvm_cuda_stream_reset_access_policy_window_packed). These features are not available on ROCm/HIP, making this gating necessary for cross-platform CI.

testing/python/kernel/test_tilelang_kernel_gemm_simt.py (3)

157-157: Correctness fix: Reference computation now uses out_dtype.

This change correctly aligns the reference result with the actual kernel output type. Previously casting to accum_dtype was incorrect when out_dtype differs from accum_dtype. This fix is consistent with similar test files (test_tilelang_kernel_gemm_mma.py line 193, test_tilelang_kernel_fp8_gemm_mma.py line 213).


164-164: LGTM: Improved accumulation precision.

Changing accum_dtype from T.float16 to T.float32 improves numerical accuracy during accumulation, which is especially important for matrix multiplication where many additions occur. This prevents potential precision loss while still producing a float16 output.


168-170: LGTM: CUDA-gated int8 test added.

The new test validates int8 matrix multiplication with int32 accumulation. The @tilelang.testing.requires_cuda decorator appropriately gates this test to CUDA environments, which aligns with the PR objective to enable ROCm CI while selectively skipping features not yet supported on ROCm.

testing/python/jit/test_tilelang_jit_gemm.py (2)

106-106: LGTM: Function name updated to reflect accumulation dtype.

The function rename from test_gemm_f16f16f16_nn_kernel_jit to test_gemm_f16f16f32_nn_kernel_jit correctly reflects the change in accumulation dtype to float32, improving code clarity.


115-115: LGTM: Float32 accumulation improves numerical precision.

Changing the accumulation dtype from T.float16 to T.float32 addresses potential numerical stability issues in GEMM operations, which aligns with the PR's bug-fix objective.

testing/python/jit/test_tilelang_jit_gemm_cython.py (1)

169-169: LGTM: Consistent float32 accumulation across Cython tests.

The systematic update of accumulation dtype from T.float16 to T.float32 across all test functions improves numerical stability and is consistent with the changes in test_tilelang_jit_gemm.py.

Also applies to: 211-211, 255-255, 304-308, 357-357

testing/python/debug/test_tilelang_debug_print.py (7)

2-2: LGTM: pytest import added for parametrization.

The pytest import is necessary for the @pytest.mark.parametrize decorator used in the refactored test.


20-24: Excellent refactoring: parametrized test improves coverage.

The parametrized test is well-structured and covers 12 dtypes (integer types, standard floating-point types, and bfloat16) in a single test function. This improves maintainability while ensuring comprehensive dtype coverage. Platform-specific FP8 dtypes are correctly tested separately with appropriate decorators.


15-15: Approved: target parameter removed to support multi-platform testing.

Removing the explicit target="cuda" parameter allows this test to run on both CUDA and ROCm platforms by using the default target selection (via TILELANG_TARGET environment variable or "auto"). This change aligns with the PR objective to enable ROCm CI testing.

Based on learnings, this addresses the previously noted target issues for this file.


50-50: Consistent with multi-platform enablement.

The removal of target="cuda" is consistent with the other test functions and enables this conditional debug print test to run on both CUDA and ROCm platforms.


69-69: LGTM: consistent target parameter removal.

Consistent with the multi-platform testing approach applied throughout this file.


88-88: LGTM: consistent target parameter removal.

Consistent with the multi-platform testing approach.


107-107: LGTM: final target parameter removal completes the refactoring.

This completes the consistent removal of explicit target="cuda" parameters throughout the file, enabling comprehensive multi-platform testing support.

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (1)

156-156: The decorator additions are justified, but the PR title may be misleading regarding scope.

These sparse GEMM tests use NVIDIA-specific features (hardcoded arch="8.0" and sparse tensor cores), making the @tilelang.testing.requires_cuda decorator appropriate. Unlike the non-sparse GEMM tests in test_tilelang_tilelibrary_gemm.py—which have separate ROCm-compatible variants (test_gemm_ss_fp8_rocm, etc.)—the sparse GEMM v2 tests have no ROCm counterparts and are genuinely CUDA-only.

However, the PR title "Open Rocm ci test and fix some bugs" may suggest broader ROCm CI enablement than is actually occurring with this change. Consider clarifying the PR title or description to reflect that these sparse tests remain architecture-specific to NVIDIA and are intentionally gated.

testing/python/language/test_tilelang_language_ptr.py (1)

44-54: No changes needed. The test correctly pairs execution_backend="cython" with CUDA tensors. The cython backend is explicitly designed to handle CUDA kernels and tensor pointers, as documented in the CythonKernelAdapter class and official documentation examples. This pattern is used throughout the test suite without issue.

Likely an incorrect or invalid review comment.

testing/python/language/test_tilelang_language_unroll.py (3)

20-20: Good explanatory comment.

The TODO clearly explains the HIP limitation for the unroll_factor parameter and why the test is CUDA-gated. This distinction from basic unroll with step is helpful for future HIP support work.


21-33: Appropriate CUDA gating and consistent target usage.

The @tilelang.testing.requires_cuda decorator correctly gates this test since HIP doesn't support the unroll_factor parameter (as noted in the TODO). The change to use the default target is consistent with the first test and safe here because the decorator ensures the test only runs on CUDA platforms where the default target will be CUDA.


6-17: No action needed. The code is correctly structured.

The first test uses T.unroll(0, 16, step=4) without CUDA gating by design. The TODO comment on line 20 specifically states that "unroll factor" (the unroll_factor parameter) is not supported on HIP, not the basic unroll with step parameter. The second test correctly uses unroll_factor and is properly gated with @tilelang.testing.requires_cuda. The distinction between these two unroll modes is intentional and correctly reflected in the test structure.

testing/python/language/test_tilelang_language_alias.py (1)

48-48: Add @tilelang.testing.requires_cuda decorator for consistency and robustness.

Every other test in testing/python/language/ includes a GPU backend requirement decorator (requires_cuda, requires_hip, or similar). This test should be gated with @tilelang.testing.requires_cuda to ensure it only runs when GPU support is available, preventing failures on systems without CUDA or ROCm.

Alternatively, if this test is intentionally designed to be backend-agnostic and work without GPU support, clarify that design choice with a comment.

⛔ Skipped due to learnings
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)

114-118: LGTM! Improved accumulation precision.

Changing dtypeAccum from T.float16 to T.float32 improves numerical stability and reduces accumulation errors during GEMM operations. This is a solid correctness improvement.


271-275: LGTM! Improved accumulation precision.

Consistent with the SS variant changes, this improves numerical stability for the RS GEMM variant.


427-431: LGTM! Improved accumulation precision.

Consistent precision improvement for the SR GEMM variant.


590-593: LGTM! Improved accumulation precision.

Consistent precision improvement for the RR GEMM variant.

testing/python/language/test_tilelang_language_mask_op.py (1)

31-31: LGTM! Backend auto-detection enables ROCm compatibility.

Removing the hardcoded target="cuda" parameter allows the compilation target to be auto-detected based on the runtime environment. The determine_target() function checks for CUDA availability first, then ROCm/HIP, then Metal, enabling these tests to run on ROCm CI while maintaining backward compatibility with CUDA systems.

Verify that these tests pass on ROCm CI with the auto-detected target.

testing/python/language/test_tilelang_language_vectorized_cast.py (1)

78-78: LGTM! Proper CUDA gating for CUDA-specific test.

The decorator correctly gates this test to CUDA-only execution, which is appropriate since the test validates CUDA-specific vectorized cast intrinsics (e.g., __float22half2_rn, __nv_cvt_float2_to_fp8x2) and uses device="cuda". This aligns with the broader pattern across the PR to ensure tests only run in their target environments.

testing/python/language/test_tilelang_language_let.py (1)

6-6: LGTM! Appropriate CUDA gating.

The decorator correctly ensures this test only runs with CUDA available, which is necessary since the test compiles with target="cuda" and validates CUDA-specific code generation.

testing/python/language/test_tilelang_language_var_init.py (1)

6-7: LGTM! Clear documentation of platform limitation.

The TODO comment helpfully explains why this test is CUDA-only (var init not supported on HIP), and the decorator correctly gates the test accordingly. This provides good context for future maintenance.

testing/python/kernel/test_tilelang_kernel_gemm.py (7)

106-106: LGTM! Appropriate CUDA gating for f16f16f16 accumulation.

The decorator correctly restricts this f16 accumulation test to CUDA, consistent with the selective gating pattern applied across similar f16f16f16 variants in this file.


172-172: LGTM! Consistent CUDA gating.

Appropriately gates the transposed-A variant to CUDA, consistent with other f16f16f16 tests.


190-190: LGTM! Consistent CUDA gating.

Appropriately gates the transposed-B variant to CUDA, consistent with other f16f16f16 tests.


216-216: LGTM! Appropriate CUDA gating for f64 test.

The decorator correctly restricts the float64 GEMM test to CUDA.


237-238: LGTM! Well-documented temporary ROCm disablement.

The TODO clearly explains the precision issue on ROCm for this specific f32 transpose combination, and the CUDA gating appropriately restricts the test until the issue is resolved. This provides good documentation for tracking the outstanding work.


255-255: LGTM! Consistent padding test gating.

Appropriately gates the aligned padding test to CUDA, consistent with other f16f16f16 tests.


273-273: LGTM! Consistent padding test gating.

Appropriately gates the unaligned padding test to CUDA, consistent with other f16f16f16 tests.

testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py (2)

199-199: LGTM! Appropriate CUDA gating for INT4 tensor core test.

The decorator correctly gates this test to CUDA-only execution, which is necessary since the test uses INT4TensorCoreIntrinEmitter for CUDA-specific tensor core operations and validates generated CUDA source code.


403-403: LGTM! Appropriate CUDA gating for weight-only transform test.

The decorator correctly gates this test to CUDA-only execution, consistent with the INT4 tensor core operations that require CUDA.

testing/python/language/test_tilelang_language_alloc.py (4)

36-36: LGTM: Flexible assertion for codegen variations.

The relaxed assertions correctly accommodate different code generation patterns for T.alloc_var(), accepting both scalar (tmp =) and array-style (tmp[0] =) representations.

Also applies to: 76-76


154-156: LGTM: More precise assertions with kernel_only=True.

Using kernel_only=True isolates the kernel source from wrapper code, and the count-based assertions (code.count("= 1;") == 1) verify that each initializer appears exactly once, making the test more robust.


159-162: Verify if ROCm truly lacks alloc_var initializer support.

Same concern as lines 118-121: this compilation-only test is gated to CUDA despite the PR goal to "Open Rocm ci test." If the ROCm limitation is only at runtime, consider allowing this test to run on ROCm CI.

See verification script at lines 118-121.


118-121: Verify ROCm alloc_var initializer support with backend-specific codegen testing.

The observation that this test only compiles and inspects source code—not GPU execution—is accurate. However, multiple tests across the codebase share this exact TODO, suggesting a genuine codegen limitation rather than overly conservative gating. Before relaxing the CUDA requirement, confirm whether:

  1. ROCm's tilelang.compile() path actually produces incorrect or unsupported syntax for alloc_var initializers
  2. The underlying issue has since been fixed in ROCm backend codegen

If confirmed, update the TODO to reference specific ROCm backend ticket or constraint. If already resolved, remove the decorator and test on ROCm CI.

testing/python/cache/test_tilelang_kernel_cache.py (1)

121-123: LGTM: Correct CUDA gating for GPU execution.

This test correctly requires CUDA because it executes kernels on GPU (lines 187-191: .cuda() calls and kernel invocations with GPU tensors).

testing/python/language/test_tilelang_language_atomic_add.py (3)

210-211: LGTM!

The dtype parameter is properly propagated to atomic_addx2_program and the default value is consistent.


238-238: LGTM!

Adding @tilelang.testing.requires_cuda decorators is appropriate since atomic operations are CUDA-specific. This ensures tests skip gracefully on non-CUDA platforms.

Also applies to: 243-243


353-357: LGTM!

Adding the @tilelang.testing.requires_cuda decorator and testing multiple dtypes (float32, float16, bfloat16) improves both portability and coverage.

src/op/logical.cc (2)

45-46: LGTM! HIP lowering support added correctly.

The HIP lowering intrinsic registration for tl.any_of mirrors the CUDA pattern and correctly reuses the same any_of_op implementation.


53-54: LGTM! HIP lowering support added correctly.

The HIP lowering intrinsic registration for tl.all_of mirrors the CUDA pattern and correctly reuses the same all_of_op implementation.

src/tl_templates/hip/common.h (4)

3-5: LGTM! Required headers added for HIP atomics and warp functions.

The includes for atomic.h and amd_warp_functions.h properly support the new tl namespace utilities.


110-130: LGTM! Any and All helpers correctly implemented.

The Any and All functions provide simple loop-based reductions that match the expected semantics for logical operations on arrays.


150-198: Verify half_t and bfloat16_t shuffle specializations.

The shuffle specializations for half_t and bfloat16_t cast to/from float before shuffling. This approach should work correctly, but confirm that:

  1. The float conversion preserves the intended precision semantics
  2. The pattern matches the CUDA implementation in src/tl_templates/cuda/common.h

Based on the relevant code snippets, the CUDA implementation uses the same pattern with casts to/from float.


132-148: The current non-sync shuffle implementation appears to handle thread synchronization appropriately.

The code uses compile-time branching (if constexpr) to segregate intra-warp communication (via __shfl_* operations) from inter-warp communication (via __syncthreads() and shared memory). Within warp-level shuffles, unconditional execution ensures all participating lanes execute uniformly. The TODO comment about ROCm 7.1.1 providing shfl_sync remains valid as a future modernization opportunity, but the current approach is sound for the intended use cases.

src/tl_templates/hip/atomic.h (1)

1-104: LGTM! HIP atomic helpers correctly implemented with documented limitation.

The atomic operation helpers are well-structured and correctly delegate to HIP's native atomic functions. The comments clearly document that the memory_order parameter is unused and exists only for lowering compatibility. According to past review comments, the memory ordering limitation was addressed in commits 14067c3 to d592f8b.

The vectorized atomic operations (AtomicAddx2, AtomicAddx4, etc.) correctly handle per-element operations on float2 and float4 types.

src/tl_templates/cuda/reduce.h (4)

185-186: LGTM! SEG parameterization improves flexibility.

Changing from a fixed warp size of 32 to the template parameter SEG allows the cumulative sum to work with different segment sizes, improving code reusability.


190-191: LGTM! Boundary check prevents out-of-bounds access.

The early return when gRow >= H correctly handles the case when the number of rows doesn't evenly divide by TILE_H.


195-221: LGTM! Axis-driven reverse cumulative sum correctly implemented.

The reverse path now computes real_row and real_col based on the Axis template parameter, allowing cumulative sums along either dimension. The logic correctly:

  1. Iterates segments in reverse order
  2. Applies per-segment reductions with shuffle operations
  3. Propagates carry values across segments
  4. Performs proper boundary checks before writing results

223-247: LGTM! Axis-driven forward cumulative sum correctly implemented.

The forward path mirrors the reverse implementation with Axis-based indexing, correctly:

  1. Iterates segments in forward order
  2. Applies per-segment reductions with shuffle operations
  3. Propagates carry values across segments
  4. Performs proper boundary checks before writing results
src/target/codegen_hip.cc (1)

985-994: All tl::warp_reduce_* functions are properly implemented in src/tl_templates/hip/reduce.h (lines 273–291) with correct HIP device syntax and expected warp-level reduction semantics.

src/tl_templates/hip/reduce.h (4)

76-76: LGTM! Correct use of type-safe shuffle wrappers.

The changes to use tl::shfl_down (line 76) and tl::shfl_xor (line 107) ensure proper handling of half_t and bfloat16_t types by leveraging the specializations in common.h.

Also applies to: 107-107


117-182: LGTM! CumSum1D implementation is correct.

The 1D cumulative sum implementation correctly:

  • Uses type-safe tl::shfl_* wrappers throughout
  • Handles both forward and reverse modes with proper carry propagation across segments
  • Includes appropriate boundary checks for input size

184-260: LGTM! CumSum2D implementation is correct.

The 2D cumulative sum implementation correctly:

  • Uses type-safe tl::shfl_* wrappers throughout
  • Properly handles axis selection with appropriate row/column index mapping
  • Implements both forward and reverse modes with correct carry propagation
  • Includes proper boundary checks for both dimensions

273-291: LGTM! Wrapper functions are correctly implemented.

The warp_reduce_* wrapper functions are properly structured and use the appropriate reduction operators. However, they depend on the warp_reduce function (lines 262-271), which needs to be fixed to use tl::shfl_xor instead of raw __shfl_xor intrinsics.

src/tl_templates/hip/debug.h (5)

6-21: LGTM! Well-designed fallback template.

The primary template provides a sensible fallback for unknown types by printing their address. This ensures the system handles any type gracefully.


23-38: LGTM! Clean macro design.

The macro effectively reduces code duplication by generating specialized PrintTraits implementations with consistent structure.


60-90: LGTM! Well-implemented explicit specializations.

The bool specialization correctly prints "true"/"false" strings, and the pointer specialization appropriately handles pointer types by printing their address.


92-101: LGTM! Clean delegation to trait system.

The public API functions properly delegate to the trait-based implementation while maintaining backward-compatible signatures.


53-57: Review focus correction: Verify fp8 type conversions specifically.

The macro uses C-style casts to float. This works for half_t (which is _Float16) and bfloat16_t (which is hip_bfloat16)—both are standard HIP types with implicit float conversion support.

However, fp8_e4_t and fp8_e5_t are typealiases to raw HIP FP8 types (__hip_fp8_e4m3_fnuz and __hip_fp8_e5m2_fnuz). These do not have operator float() definitions in the codebase. Verify whether the HIP runtime provides implicit float conversion for these raw FP8 types; if not, explicit conversion functions may be needed.

tilelang/jit/adapter/wrapper.py (4)

314-316: Good refactoring for maintainability.

Extracting the declaration parsing logic into a dedicated method improves code organization and enables subclass-specific overrides (as demonstrated by the HIP implementation).


592-592: LGTM!

The uint64 to uint64_t type mapping is correct and consistent with other integer type mappings in the class.


585-585: Mapping is correct. The type fp8_e5_t is properly defined in HIP templates (src/tl_templates/hip/hip_fp8.h line 10) as an alias for __hip_fp8_e5m2_fnuz, confirming the mapping of "float8_e5m2fnuz" to "fp8_e5_t" is sound.


612-615: The HIP-specific parsing correctly uses split("{")[0] to extract the kernel signature. This is the appropriate approach for HIP kernels, which transition directly to the function body rather than using semicolon-terminated declarations like CUDA. No robustness concerns exist—the implementation is correct as-is.

Comment on lines +945 to +946
{"float8_e5m2fnuzx4", "fp8_e5_4_t"},
{"float8_e5m2fnuzx8", "long"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file context around lines 945-946
sed -n '935,955p' src/target/codegen_hip.cc

Repository: tile-ai/tilelang

Length of output: 890


🏁 Script executed:

# Also search for other FP8 mappings to understand the pattern
rg -A5 -B5 "float8_e5m2fnuzx" src/target/codegen_hip.cc

Repository: tile-ai/tilelang

Length of output: 638


🏁 Script executed:

# Search for type definitions related to fp8_e5_4_t
rg "fp8_e5_4_t|fp8.*e5" --type cc --type h

Repository: tile-ai/tilelang

Length of output: 86


🌐 Web query:

ROCm MFMA FP8 e5m2 data type support and mappings hip header

💡 Result:

Short summary of ROCm FP8 (E5M2) / MFMA HIP mappings and header names

  • HIP C typedefs / builtins:

    • __hip_fp8_e5m2 — standard OCP E5M2 FP8 type. [1]
    • __hip_fp8_e5m2_fnuz — FNUZ variant (no inf, signed-zero handling per AMD). [1][2]
  • hipDataType enum entries (library_types.h):

    • HIP_R_8F_E5M2 = 29 (maps to __hip_fp8_e5m2). [1]
    • HIP_R_8F_E5M2_FNUZ = 1001 (maps to __hip_fp8_e5m2_fnuz). [1]
  • Notes about MFMA / hardware support:

    • FP8 (E5M2) is supported as an internal compute type on CDNA3 / MI300-class Matrix cores (MFMA/Matrix units); library support (rocWMMA/rocBLAS/hipBLAS/etc.) depends on hardware and ROCm release. [2][1]

References
[1] ROCm “Data types and precision support” / hip type & hipDataType listings.
[2] ROCm “Precision support” (notes on E5M2, FNUZ and MI300/Matrix cores).


🏁 Script executed:

# Search for fp8_e5_4_t and fp8_e4_4_t type definitions
rg "fp8_e[45]_4_t" src/

Repository: tile-ai/tilelang

Length of output: 2784


🏁 Script executed:

# Check if these types are defined in HIP headers or locally
fd -e h | xargs grep -l "fp8_e[45]_4_t" 2>/dev/null || echo "Not found in headers"

Repository: tile-ai/tilelang

Length of output: 123


🏁 Script executed:

# Search more broadly for FP8 type definitions
rg "typedef.*fp8|struct.*fp8" --type-list | head -20

Repository: tile-ai/tilelang

Length of output: 449


🏁 Script executed:

# Look for where these mappings are used/applied
rg "float8_e5m2fnuzx4|float8_e4m3fnuzx4" src/

Repository: tile-ai/tilelang

Length of output: 201


🏁 Script executed:

# Read the hip_fp8.h file to see all FP8 type definitions
cat -n src/tl_templates/hip/hip_fp8.h | head -100

Repository: tile-ai/tilelang

Length of output: 3006


🏁 Script executed:

# Check if there's an 8-element FP8 type
rg "fp8_e[45]_8_t|e[45]m[23].*8" src/tl_templates/hip/hip_fp8.h

Repository: tile-ai/tilelang

Length of output: 321


🏁 Script executed:

# Look at the full structure of fp8_e5_8_t and understand the x8 mapping
sed -n '1,200p' src/tl_templates/hip/hip_fp8.h

Repository: tile-ai/tilelang

Length of output: 5884


🏁 Script executed:

# Look at broader context around the dtype map definition
sed -n '920,960p' src/target/codegen_hip.cc

Repository: tile-ai/tilelang

Length of output: 1932


🏁 Script executed:

# Search for any comments or documentation about x8 → long mapping
rg -B10 "float8.*fnuzx8.*long" src/target/codegen_hip.cc

Repository: tile-ai/tilelang

Length of output: 525


🏁 Script executed:

# Check if there are MFMA builtin requirements documented
rg "mfma.*float8|MFMA.*FP8" src/target/codegen_hip.cc | head -20

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Search for any test or usage of these x8 mappings
rg "float8_e[45]m[23]fnuzx8" src/ --type cc --type h

Repository: tile-ai/tilelang

Length of output: 86


Fix FP8 x8 dtype mappings to use proper struct types.

The x8 FP8 dtype mappings use long instead of the semantically correct struct types defined in hip_fp8.h. The correct mappings should be:

  • float8_e4m3fnuzx8fp8_e4_8_t (currently maps to long)
  • float8_e5m2fnuzx8fp8_e5_8_t (currently maps to long)

Both fp8_e4_8_t and fp8_e5_8_t are fully defined in src/tl_templates/hip/hip_fp8.h with proper member access and alignment. Using long is inconsistent with how x4 variants are correctly mapped to their struct types (fp8_e4_4_t and fp8_e5_4_t) and loses type safety for MFMA operations.

🤖 Prompt for AI Agents
In src/target/codegen_hip.cc around lines 945-946 the FP8 x8 dtype mappings
incorrectly use "long"; replace the two mappings so "float8_e4m3fnuzx8" maps to
"fp8_e4_8_t" and "float8_e5m2fnuzx8" maps to "fp8_e5_8_t" and ensure the file
includes the hip_fp8.h header (or the header that defines fp8_e4_8_t and
fp8_e5_8_t) so the struct types are available for MFMA/type-safe operations.

Comment on lines +178 to +179
static TL_DEVICE void run(const T *__restrict__ src, T *__restrict__ dst,
int H, int W) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual file to confirm the current signature
head -n 200 src/tl_templates/cuda/reduce.h | tail -n 30

Repository: tile-ai/tilelang

Length of output: 1065


🏁 Script executed:

# Search for all calls to CumSum2D::run
rg -n 'CumSum2D.*::run\s*\(' --type=cpp --type=h -C3

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Also search more broadly for CumSum2D usage
rg -n 'CumSum2D' --type=cpp --type=h

Repository: tile-ai/tilelang

Length of output: 365


🏁 Script executed:

# Look for any assignment from CumSum2D::run calls
rg -n 'CumSum2D.*run\s*\(' . --type=cpp --type=h -A2 -B2

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check the reduce.cc file around line 551 to see how CumSum2D is used
sed -n '540,560p' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 919


🏁 Script executed:

# Search for actual invocations/calls of CumSum2D::run
rg 'CumSum2D' -A5 -B5 --type=cc --type=h | head -80

Repository: tile-ai/tilelang

Length of output: 86


🏁 Script executed:

# Look for any kernel launches or calls that might use CumSum2D
rg -n '\.run\s*\(' src/op/reduce.cc -A3 -B3

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check the full context of CumSum2D in reduce.cc
sed -n '545,570p' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 1101


🏁 Script executed:

# Check the exact lines again with more context
sed -n '545,570p' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 1101


🏁 Script executed:

# Look for any mismatch between Call return type and function signature
rg 'Evaluate\(Call\(' src/op/reduce.cc -B10 | head -40

Repository: tile-ai/tilelang

Length of output: 567


🏁 Script executed:

# Check if there are TODO/FIXME comments related to CumSum
rg -i 'TODO|FIXME' src/op/reduce.cc | grep -i cumsum

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Verify the current signature of CumSum1D and CumSum2D to understand the pattern
rg 'CumSum1D.*run' -A2 src/tl_templates/cuda/reduce.h

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check the full CumSum implementation pattern to understand intent
cat src/tl_templates/cuda/reduce.h | head -250 | tail -80

Repository: tile-ai/tilelang

Length of output: 2419


🏁 Script executed:

# Look for other uses of call_extern with different return types
rg 'call_extern' src/op/reduce.cc -B3 -A3

Repository: tile-ai/tilelang

Length of output: 1198


🏁 Script executed:

# Check if there are any type conversion or void handling in the call
rg -A20 'CumSum2D.*::run' src/tl_templates/cuda/reduce.h | head -30

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check CumSum1D to see if it also returns void
rg -B2 -A5 'static TL_DEVICE.*run.*CumSum1D' src/tl_templates/cuda/reduce.h

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look at the full context around the Call to understand what dst->dtype represents
sed -n '520,575p' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 2195


🏁 Script executed:

# Find CumSum1D::run signature
rg -A3 'struct CumSum1D' src/tl_templates/cuda/reduce.h | head -30

Repository: tile-ai/tilelang

Length of output: 298


🏁 Script executed:

# Check if there are any recent changes or git history
git log --oneline src/op/reduce.cc | head -10

Repository: tile-ai/tilelang

Length of output: 90


🏁 Script executed:

# Check for any compilation errors or warnings
rg 'void' src/tl_templates/cuda/reduce.h | grep -i cumsum

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Get the full CumSum1D::run signature
rg -A8 'struct CumSum1D' src/tl_templates/cuda/reduce.h | grep -A5 'static TL_DEVICE'

Repository: tile-ai/tilelang

Length of output: 203


🏁 Script executed:

# Check git history to see if this was a recent change
git log -p --follow -- src/tl_templates/cuda/reduce.h | grep -B5 -A5 'CumSum.*void' | head -30

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Look at the last commit that touched this file
git log --oneline -n 5 -- src/tl_templates/cuda/reduce.h

Repository: tile-ai/tilelang

Length of output: 90


🏁 Script executed:

# Check if there's special handling for void in call_extern
rg 'call_extern' src/ -B2 -A2 | grep -B5 -A5 'void'

Repository: tile-ai/tilelang

Length of output: 1550


🏁 Script executed:

# Look for any Cast operations around the CumSum call
sed -n '535,575p' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 1532


🏁 Script executed:

# Check the type of builtin::call_extern to understand expected return type
rg 'builtin::call_extern' src/ -B3 | head -30

Repository: tile-ai/tilelang

Length of output: 2119


🏁 Script executed:

# Verify if there's a mismatch between function signature and codegen
rg -A1 'Call.*dst.*dtype.*call_extern' src/op/reduce.cc

Repository: tile-ai/tilelang

Length of output: 139


🏁 Script executed:

# Look for other examples of Call with void return type
rg 'Call\(.*void' src/ --type=cpp | head -20

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if DataType::Void() exists
rg 'DataType::Void\|void_dtype' src/ --type=cpp

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Examine how other void-returning functions are called via call_extern
rg -B5 'DataType::Handle' src/op/ | grep -A3 'call_extern'

Repository: tile-ai/tilelang

Length of output: 457


🏁 Script executed:

# Check if the mismatch between Call(dst->dtype) and void return has any handling
git log -p --all -- src/op/reduce.cc | grep -B10 -A10 'CumSum2D.*void' | head -40

Repository: tile-ai/tilelang

Length of output: 42


Fix code generator to match void return type of CumSum functions.

Both CumSum1D::run and CumSum2D::run now return void, writing results directly to dst. However, the code generator in src/op/reduce.cc line 570 incorrectly declares the return type as dst->dtype. Update the call to use DataType::Handle() instead, matching the pattern used elsewhere for void-returning call_extern invocations (see src/op/gemm.cc and src/op/copy.cc).

Current code generation (reduce.cc:570)
return Evaluate(Call(dst->dtype, builtin::call_extern(), args));
🤖 Prompt for AI Agents
In src/op/reduce.cc around line 570, the code generator incorrectly emits a
return type of dst->dtype for CumSum call_extern invocations even though
CumSum1D::run and CumSum2D::run are void and write directly into dst; change the
emitted return type to DataType::Handle() (the void/handle pattern used in
src/op/gemm.cc and src/op/copy.cc) so the generated call uses DataType::Handle()
for call_extern and the Evaluate(...) call matches the void-returning extern
function.

Comment on lines +8 to +9
# There is a known issue where the cython execution backend fails to build with T.symbolic.
# Forcing the TVM FFI execution backend to avoid the issue on HIP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add issue reference for better tracking.

The comment explains the workaround but doesn't include an issue reference. Based on past review feedback, you mentioned opening an issue for this cython backend bug. Please add the issue number here so maintainers know when it's safe to remove this workaround.

🔎 Suggested improvement
-# There is a known issue where the cython execution backend fails to build with T.symbolic.
-# Forcing the TVM FFI execution backend to avoid the issue on HIP.
+# There is a known issue where the cython execution backend fails to build with T.symbolic (see issue #XXXX).
+# Forcing the TVM FFI execution backend to avoid the issue on HIP.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# There is a known issue where the cython execution backend fails to build with T.symbolic.
# Forcing the TVM FFI execution backend to avoid the issue on HIP.
# There is a known issue where the cython execution backend fails to build with T.symbolic (see issue #XXXX).
# Forcing the TVM FFI execution backend to avoid the issue on HIP.
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_annot.py around lines 8-9, the
workaround comment for forcing the TVM FFI execution backend lacks an issue
reference; update the comment to include the tracking issue number or URL (e.g.,
"See issue #<NNNN>" or full GitHub issue link). If the issue hasn't been opened
yet, create one describing the cython backend/T.symbolic build failure on HIP
and then insert that new issue number/link in the comment so maintainers know
when to remove the workaround.

Comment on lines +448 to +470
@tilelang.testing.requires_cuda
@pytest.mark.parametrize(
"M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads",
[
(128, 128, 128, True, True, T.float8_e5m2, T.float8_e5m2, T.float32, 128, 128, 32, 2, 128),
],
)
def test_gemm_sr_fp8_cuda(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads):
run_gemm_sr(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads)


@tilelang.testing.requires_rocm
@pytest.mark.parametrize(
"M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads",
[
# TODO: There is precision problem needs to repair
# (128, 128, 128, True, True, T.float8_e5m2fnuz, T.float8_e5m2fnuz, T.float32, 128, 128, 32, 2, 128),
(128, 128, 128, True, True, T.float8_e4m3fnuz, T.float8_e4m3fnuz, T.float32, 128, 128, 32, 2, 128),
],
)
def test_gemm_sr_fp8_rocm(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads):
run_gemm_sr(M, N, K, trans_A, trans_B, in_dtype, out_dtype, dtypeAccum, block_M, block_N, block_K, num_stages, num_threads)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the test file and check its line count
wc -l testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py

Repository: tile-ai/tilelang

Length of output: 124


🏁 Script executed:

# Search for all FP8 test definitions to see the complete picture
rg -n "def test_gemm.*fp8" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -A 2

Repository: tile-ai/tilelang

Length of output: 2320


🏁 Script executed:

# Search for skip markers in relation to FP8 tests
rg -n "skip|pytest.mark.skip" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -B 5 | head -100

Repository: tile-ai/tilelang

Length of output: 1956


🏁 Script executed:

# Check the context around SS and RS FP8 tests to see if they have skip markers
rg -n "test_gemm_ss_fp8|test_gemm_rs_fp8|test_gemm_rr_fp8" testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py -B 8 -A 5

Repository: tile-ai/tilelang

Length of output: 4594


SR and RR FP8 tests are enabled while SS and RS remain skipped—document the rationale.

The test gating is intentionally differentiated by GEMM variant: SS and RS variants are skipped with explicit reasons ("GEMM SS/RS issues are resolved"), while SR and RR variants run without skip markers. This pattern suggests variants have different maturity levels, but the decision rationale is not documented in the code.

Additionally, both SR and RR ROCm tests contain identical untracked TODO comments about float8_e5m2fnuz precision problems (lines 463-464, 627-629). Create tracking issues for these precision problems and add inline documentation explaining why SR/RR are enabled while SS/RS are disabled.

🤖 Prompt for AI Agents
In testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py around lines 448
to 470, document why SR and RR GEMM FP8 tests are enabled while SS and RS are
skipped, and replace the untracked TODOs about float8_e5m2 precision with links
to tracking issues: create one or two issues in the repo for the float8_e5m2fnuz
precision problem (and any related float8 variants) and update the two ROCm TODO
comments (and the earlier SS/RS skip comments) to reference those issue
numbers/URLs and a short rationale sentence stating why SR/RR are considered
stable enough to run while SS/RS remain skipped pending fixes; ensure both ROCm
occurrences reference the same tracking issue so future readers can find the bug
report.

from tvm.ir import CallingConv
from tvm.target import Target
from tilelang.contrib import hipcc, nvcc
from tilelang.env import COMPOSABLE_KERNEL_INCLUDE_DIR, CUTLASS_INCLUDE_DIR, TILELANG_TEMPLATE_PATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify that required path constants are defined in tilelang/env.py

echo "=== Checking for constant definitions in tilelang/env.py ==="
rg -n "^(TILELANG_TEMPLATE_PATH|CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR)\s*=" tilelang/env.py

echo -e "\n=== Checking if constants are exported ==="
rg -n "^__all__" tilelang/env.py -A 20

echo -e "\n=== Full contents of tilelang/env.py to understand context ==="
head -100 tilelang/env.py

Repository: tile-ai/tilelang

Length of output: 4415


🏁 Script executed:

#!/bin/bash
# Check imports at the beginning of tilelang/env.py
echo "=== Imports in tilelang/env.py ==="
head -50 tilelang/env.py | grep -E "^import|^from"

# Check context around lines 372-374
echo -e "\n=== Context around constant definitions (lines 350-400) ==="
sed -n '350,400p' tilelang/env.py

# Check for error handling related to these constants
echo -e "\n=== Searching for error handling or validation ==="
rg -n "CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR|TILELANG_TEMPLATE_PATH" tilelang/env.py | head -30

Repository: tile-ai/tilelang

Length of output: 2144


🏁 Script executed:

#!/bin/bash
# Find the env object definition
echo "=== Searching for 'env =' in tilelang/env.py ==="
rg -n "^env\s*=" tilelang/env.py

# Check for EnvVar class definition
echo -e "\n=== Searching for EnvVar class ==="
rg -n "class EnvVar" tilelang/env.py -A 10

# Check full initialization around those lines
echo -e "\n=== Context around EnvVar initialization (lines 200-240) ==="
sed -n '200,240p' tilelang/env.py

Repository: tile-ai/tilelang

Length of output: 2435


🏁 Script executed:

#!/bin/bash
# Get full EnvVar class definition
echo "=== EnvVar class definition ==="
sed -n '125,180p' tilelang/env.py

# Check if there's a get() method
echo -e "\n=== EnvVar methods ==="
rg -n "def " tilelang/env.py | grep -A 5 -B 5 "EnvVar"
</parameter>

Repository: tile-ai/tilelang

Length of output: 2804


🏁 Script executed:

#!/bin/bash
# Get full EnvVar class definition
echo "=== EnvVar class definition (lines 125-200) ==="
sed -n '125,200p' tilelang/env.py

Repository: tile-ai/tilelang

Length of output: 3431


🏁 Script executed:

#!/bin/bash
# Check how the constants are used in lower.py
echo "=== Usage of imported constants in lower.py ==="
rg -n "TILELANG_TEMPLATE_PATH|CUTLASS_INCLUDE_DIR|COMPOSABLE_KERNEL_INCLUDE_DIR" tilelang/engine/lower.py -A 3 -B 1

# Get full lower.py to understand context
echo -e "\n=== Full lower.py file ==="
wc -l tilelang/engine/lower.py

Repository: tile-ai/tilelang

Length of output: 931


🏁 Script executed:

#!/bin/bash
# Check if constants can be None
echo "=== Checking EnvVar dataclass decorator ==="
sed -n '120,130p' tilelang/env.py

# Check the initialization logic more carefully
echo -e "\n=== Full initialization section (lines 340-375) ==="
sed -n '340,375p' tilelang/env.py

# Check if there's any None-handling in lower.py
echo -e "\n=== Check for None handling in lower.py around those lines ==="
sed -n '60,130p' tilelang/engine/lower.py

Repository: tile-ai/tilelang

Length of output: 4147


Add validation to ensure required paths are available before use.

The constants are properly defined in tilelang.env and initialized with path discovery logic. However, when paths are not found, the constants default to None, which will cause a TypeError when lower.py attempts string concatenation with -I at lines 73–74 and 120–121. Add a runtime check to either:

  • Validate that these constants are non-None before using them
  • Raise an informative error at import time if required paths are unavailable
  • Provide fallback defaults if installation layouts vary
🤖 Prompt for AI Agents
In tilelang/engine/lower.py around line 13 (and referencing uses at lines 73–74
and 120–121), the imported path constants from tilelang.env may be None and are
later concatenated with "-I", causing a TypeError; add a runtime validation at
module import (or before the first use) that COMPOSABLE_KERNEL_INCLUDE_DIR,
CUTLASS_INCLUDE_DIR and TILELANG_TEMPLATE_PATH are not None, and if any are
missing raise a clear ImportError/RuntimeError explaining which path is missing
and how to provide it (e.g., install prerequisites or set environment
variables), or alternatively supply documented fallback defaults if appropriate
for your project layout. Ensure the check reports the specific constant name and
its expected purpose so the error is actionable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (15)
.github/workflows/ci.yml (1)

376-376: Grammar issue already flagged in previous review.

The grammatical error in this comment ("needs to repair" → "need to be repaired") was already identified in a previous review.

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (6)

133-156: Add tracking issue references to skip decorators.

The FP8 tests for the SS GEMM variant are skipped with a generic reason. Create tracking issues for the "GEMM SS issues" referenced in the skip decorator, then update the decorator to include the issue number for traceability (e.g., reason="Temporarily disabling until GEMM SS issues are resolved. See issue #XXXX").

Based on past review comments, this issue was previously flagged but not addressed.


290-313: Add tracking issue references to skip decorators.

The FP8 tests for the RS GEMM variant are skipped with a generic reason. Create tracking issues for the "GEMM RS issues" and update the skip decorator to include the issue number for traceability.

Based on past review comments, this issue was previously flagged but not addressed.


432-441: Replace untracked TODO comments with tracking issue references.

The TODO comments about precision problems on ROCm (lines 432-434 for int8, lines 439-441 for float) lack tracking issue references. Create tracking issues for these precision problems and replace the TODO comments with issue references for traceability.

Based on past review comments, untracked TODOs were previously flagged but not addressed.


448-470: Document the rationale for enabling SR FP8 tests while SS/RS remain disabled.

The SR GEMM variant FP8 tests are enabled (no skip decorator) while SS and RS variants remain skipped. This suggests different maturity levels, but the decision rationale is not documented. Additionally, the TODO comment about float8_e5m2fnuz precision problems (lines 463-464) lacks a tracking issue reference.

Create tracking issues for the precision problems and add inline documentation explaining why SR/RR variants are considered stable enough to run while SS/RS remain skipped pending fixes.

Based on past review comments, this issue was previously flagged but not addressed.


595-597: Replace untracked TODO comments with tracking issue references.

The TODO comments about precision problems on ROCm (lines 595-597 and 627-629) both reference float8_e5m2fnuz precision issues but lack tracking issue references. Create a tracking issue for this precision problem and replace the TODO comments with issue references for traceability.

Based on past review comments, untracked TODOs were previously flagged but not addressed.

Also applies to: 627-629


612-634: Document the rationale for enabling RR FP8 tests while SS/RS remain disabled.

The RR GEMM variant FP8 tests are enabled (no skip decorator) while SS and RS variants remain skipped. Add inline documentation explaining why RR (like SR) is considered stable enough to run while SS/RS remain skipped pending fixes.

Based on past review comments, this issue was previously flagged but not addressed.

testing/python/cache/test_tilelang_kernel_cache.py (2)

199-199: Remove the CUDA decorator from this compilation-only test.

This test only calls tilelang.compile() without any GPU execution. The @tilelang.testing.requires_cuda decorator is inappropriate and will skip this test on ROCm CI, contradicting the PR objective to "Open Rocm ci test."

Based on past review comments, this issue was previously flagged but not addressed.


250-250: Remove the CUDA decorator from this compilation-only test.

This test only calls tilelang.compile() without any GPU execution. The @tilelang.testing.requires_cuda decorator is inappropriate and will skip this test on ROCm CI, contradicting the PR objective to "Open Rocm ci test."

Based on past review comments, this issue was previously flagged but not addressed.

src/tl_templates/hip/debug.h (1)

40-49: Missing unsigned long long specialization.

The past review comment about the missing unsigned long long specialization has not been addressed. Line 49 defines long long but without unsigned long long, uint64_t values may fall back to the pointer-printing trait instead of printing their actual numeric values.

🔎 Proposed fix
 DEFINE_PRINT_TRAIT(long long, "long long", "%lld", long long);
+DEFINE_PRINT_TRAIT(unsigned long long, "unsigned long long", "%llu", unsigned long long);
src/tl_templates/hip/reduce.h (1)

262-271: Use tl::shfl_xor for type safety with half/bfloat16 types.

This function uses raw __shfl_xor intrinsics which will produce incorrect results for half_t and bfloat16_t types. These types require the tl::shfl_xor wrapper that converts through float. This is inconsistent with the rest of this file where tl::shfl_* wrappers are correctly used.

🔎 Proposed fix
 template <typename T, typename ReduceOp>
 TL_DEVICE T warp_reduce(T value, ReduceOp op) {
-  value = op(value, __shfl_xor(value, 32));
-  value = op(value, __shfl_xor(value, 16));
-  value = op(value, __shfl_xor(value, 8));
-  value = op(value, __shfl_xor(value, 4));
-  value = op(value, __shfl_xor(value, 2));
-  value = op(value, __shfl_xor(value, 1));
+  value = op(value, tl::shfl_xor(value, 32));
+  value = op(value, tl::shfl_xor(value, 16));
+  value = op(value, tl::shfl_xor(value, 8));
+  value = op(value, tl::shfl_xor(value, 4));
+  value = op(value, tl::shfl_xor(value, 2));
+  value = op(value, tl::shfl_xor(value, 1));
   return value;
 }
testing/python/language/test_tilelang_language_atomic_add.py (1)

253-254: Add @requires_cuda decorator and fix dtype for atomic_addx2.

This test has two critical issues:

  1. Missing decorator: test_atomic_addx2_half (line 249) has @tilelang.testing.requires_cuda, but this test doesn't. Both use the same CUDA-specific operation and should both require it.

  2. Invalid dtype: atomic_addx2 is designed for paired half-precision types (float16/bfloat16) as shown in the function docstring examples. Using float32 will likely fail or produce incorrect results. Use atomic_addx4 for float32 (which already has a dedicated test at line 360), or change this test to use float16 or bfloat16.

Immediate fix for missing decorator
+@tilelang.testing.requires_cuda
 def test_atomic_addx2_float():
     run_atomic_addx2(32, 64, 8, 16, dtype=T.float32)

Then either:

  • Change dtype=T.float32 to dtype=T.float16 (or T.bfloat16)
  • Or remove this test if test_atomic_addx4 at line 360 already covers float32 atomic operations
testing/python/language/test_tilelang_language_mask_op.py (3)

65-66: Same verification needed as lines 31-32.

This follows the same pattern as the first test - removed target="cuda" from compile while keeping hardcoded device="cuda" in torch tensor creation.


100-101: Same verification needed as lines 31-32.

This follows the same pattern as the other tests - removed target="cuda" from compile while keeping hardcoded device="cuda" in torch tensor creation.


134-135: Same verification needed as lines 31-32.

This follows the same pattern as the other tests - removed target="cuda" from compile while keeping hardcoded device="cuda" in torch tensor creation.

🧹 Nitpick comments (8)
src/tl_templates/hip/atomic.h (1)

7-11: Redundant reinterpret_cast in pointer overloads.

The reinterpret_cast<T1*>(address) on line 10 is a no-op since address is already of type T1*. The same applies to other pointer overloads (lines 35, 52). This is harmless but adds unnecessary noise.

🔎 Proposed simplification
 template <typename T1, typename T2>
 __forceinline__ __device__ void AtomicAdd(T1 *address, T2 val,
                                           int memory_order = 0) {
-  atomicAdd(reinterpret_cast<T1 *>(address), static_cast<T1>(val));
+  atomicAdd(address, static_cast<T1>(val));
 }

Similar changes for AtomicMax (line 35) and AtomicMin (line 52).

src/tl_templates/hip/common.h (1)

112-130: Sequential loops may be slow for large arrays.

The Any and All functions use sequential iteration on the device. This is fine for small fixed-size arrays but could be a performance bottleneck if called with large size values.

Consider documenting the expected use case (e.g., small compile-time-known sizes) or adding a warp-parallel variant if performance becomes an issue.

tilelang/jit/adapter/wrapper.py (1)

350-351: Consider adding error handling for edge cases.

The implementation correctly extracts CUDA kernel declarations by splitting at the first semicolon. However, if the semicolon is missing or appears in an unexpected context (e.g., inside a comment or string), the method returns the entire input string, which may not be the intended behavior.

💡 Optional: Add validation or use more robust parsing
 def get_declaration(self, declare_kernel_code: str) -> str:
-    return declare_kernel_code.split(";")[0]
+    parts = declare_kernel_code.split(";", 1)
+    if len(parts) < 2:
+        logger.warning(f"Expected semicolon in kernel declaration, got: {declare_kernel_code[:100]}")
+    return parts[0]
testing/python/language/test_tilelang_language_atomic_add.py (1)

214-215: Align tensor initialization with the standard pattern.

The two-step initialization (create as float32 → convert to dtype) differs from the direct-creation pattern used consistently throughout this file. Compare with lines 30-31, 63-64, 98-99, etc.: torch.randn(..., dtype=getattr(torch, dtype)).cuda().

🔎 Refactor to match standard pattern
-    A = torch.randn(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
-    B = torch.zeros(M, N, dtype=torch.float32).cuda().to(getattr(torch, dtype))
+    A = torch.randn(M, N, dtype=getattr(torch, dtype)).cuda()
+    B = torch.zeros(M, N, dtype=getattr(torch, dtype)).cuda()

If the two-step conversion is intentional (e.g., to match specific numerical behavior), add a brief comment explaining why.

testing/python/kernel/test_tilelang_kernel_gemm.py (1)

255-255: Consistent with f16 accumulation gating pattern.

These padding tests follow the same gating pattern as other f16f16f16 tests, while test_pad_f16f16f32_nn (line 291) remains ungated. This confirms the ROCm issue is related to f16 accumulation dtype rather than padding logic.

Consider consolidating the ROCm compatibility constraints into documentation or a decision matrix showing:

  • Which dtype combinations work on CUDA vs ROCm
  • Whether the limitation is temporary (bugs to fix) or architectural (unsupported operations)

Also applies to: 273-273

testing/python/language/test_tilelang_language_mask_op.py (1)

37-141: Consider adding CUDA availability checks for consistency with related tests.

While most language tests don't include GPU checks, test_tilelang_language_parallel.py (in the same directory) includes a pattern for gracefully skipping when CUDA is unavailable:

if not torch.cuda.is_available():
    pytest.skip("CUDA not available")

Since these tests use device="cuda" and will fail on non-GPU systems, consider adding similar checks to the test functions for consistency.

testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

32-32: Consider parameterizing the device for multi-backend testing.

While ROCm's PyTorch typically supports device="cuda" for compatibility, explicitly using a hardcoded CUDA device string is conceptually inconsistent with removing the explicit target="cuda" from the compile call above. For clearer multi-backend support, consider parameterizing the device based on available hardware.

testing/python/language/test_tilelang_language_alloc.py (1)

154-156: Consider more robust initializer assertions.

The kernel_only=True parameter and exact count assertions improve test precision. However, the current assertions code.count("= 1;") and code.count("= 2;") could fail if other unrelated assignments with these values appear in the generated kernel code.

🔎 Suggested improvement for more robust assertions

Consider using more specific patterns that capture the variable name context:

 kernel = tilelang.compile(program, out_idx=[1])
 code = kernel.get_kernel_source(kernel_only=True)
-assert code.count("= 1;") == 1
-assert code.count("= 2;") == 1
+import re
+# Match variable initialization patterns more precisely
+assert len(re.findall(r'\btmp0\s*=\s*1;', code)) == 1
+assert len(re.findall(r'\btmp1\s*=\s*2;', code)) == 1

This ensures we're specifically checking the tmp0 and tmp1 initializations rather than any occurrence of = 1; or = 2;.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16c4dc4 and 59637e8.

📒 Files selected for processing (54)
  • .github/workflows/ci.yml
  • src/op/gemm.cc
  • src/op/logical.cc
  • src/target/codegen_hip.cc
  • src/tl_templates/cuda/reduce.h
  • src/tl_templates/hip/atomic.h
  • src/tl_templates/hip/common.h
  • src/tl_templates/hip/debug.h
  • src/tl_templates/hip/hip_fp8.h
  • src/tl_templates/hip/reduce.h
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/debug/test_device_assert.py
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/jit/test_tilelang_jit_gemm_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/language/test_tilelang_language_composable_index.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • tilelang/engine/lower.py
  • tilelang/intrinsics/mfma_layout.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • tilelang/jit/adapter/wrapper.py
💤 Files with no reviewable changes (3)
  • testing/python/issue/test_tilelang_issue_830.py
  • testing/python/language/test_tilelang_language_copy.py
  • testing/python/language/test_tilelang_language_composable_index.py
🚧 Files skipped from review as they are similar to previous changes (25)
  • testing/python/issue/test_tilelang_issue_96.py
  • testing/python/components/test_storage_rewrite_detect_inplace.py
  • testing/python/language/test_tilelang_language_clear.py
  • testing/python/carver/test_tilelang_carver_recommend_hints.py
  • tilelang/intrinsics/mfma_macro_generator.py
  • src/op/logical.cc
  • testing/python/debug/test_tilelang_debug_print.py
  • testing/python/issue/test_tilelang_issue_1001.py
  • testing/python/issue/test_tilelang_issue_1008.py
  • testing/python/language/test_tilelang_language_frontend_v2.py
  • testing/python/jit/test_tilelang_jit_parcompile.py
  • testing/python/language/test_tilelang_language_let.py
  • testing/python/language/test_tilelang_language_var_init.py
  • testing/python/language/test_tilelang_language_annot.py
  • testing/python/language/test_tilelang_language_infinity.py
  • testing/python/kernel/test_tilelang_kernel_gemm_simt.py
  • src/target/codegen_hip.cc
  • src/op/gemm.cc
  • testing/python/language/test_tilelang_language_ptr.py
  • testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py
  • testing/python/debug/test_device_assert.py
  • tilelang/engine/lower.py
  • src/tl_templates/hip/hip_fp8.h
  • testing/python/language/test_tilelang_language_vectorized_cast.py
  • testing/python/carver/test_tilelang_carver_cuda_driver_properties.py
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/autotune/test_tilelang_autotune.py
  • testing/python/jit/test_tilelang_jit_cutedsl.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_vectorize.py
  • testing/python/jit/test_tilelang_jit_callback.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/jit/test_tilelang_jit_gemm.py
  • testing/python/cache/test_tilelang_kernel_cache.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_annotate_safe_value.py
  • testing/python/language/test_tilelang_language_mask_op.py
  • testing/python/language/test_tilelang_language_alloc.py
  • testing/python/language/test_tilelang_language_unroll.py
  • testing/python/language/test_tilelang_language_atomic_add.py
  • testing/python/language/test_tilelang_language_alias.py
  • testing/python/kernel/test_tilelang_kernel_gemm.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py
  • testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py
📚 Learning: 2025-12-26T06:45:51.789Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1483
File: tilelang/jit/adapter/cutedsl/adapter.py:93-95
Timestamp: 2025-12-26T06:45:51.789Z
Learning: For the CuTeDSL backend in tilelang/jit/adapter/cutedsl/adapter.py, the host_kernel_source and device_kernel_source have the same value.

Applied to files:

  • testing/python/jit/test_tilelang_jit_cutedsl.py
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/tl_templates/hip/reduce.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/tl_templates/hip/reduce.h
🧬 Code graph analysis (11)
testing/python/language/test_tilelang_language_annotate_safe_value.py (1)
tilelang/jit/kernel.py (1)
  • out_idx (609-610)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
testing/python/language/test_tilelang_language_mask_op.py (1)
tilelang/jit/__init__.py (2)
  • compile (47-107)
  • compile (347-373)
src/tl_templates/hip/reduce.h (1)
src/tl_templates/hip/common.h (13)
  • tl (110-200)
  • shfl_down (138-140)
  • shfl_down (157-161)
  • shfl_down (182-186)
  • shfl_xor (134-136)
  • shfl_xor (151-155)
  • shfl_xor (176-180)
  • shfl (146-148)
  • shfl (169-173)
  • shfl (194-198)
  • shfl_up (142-144)
  • shfl_up (163-167)
  • shfl_up (188-192)
testing/python/jit/test_tilelang_jit_callback.py (2)
testing/python/cache/test_tilelang_kernel_cache.py (1)
  • callback (57-60)
tilelang/engine/callback.py (2)
  • register_cuda_postproc_callback (44-74)
  • register_hip_postproc_callback (77-107)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
testing/python/jit/test_tilelang_jit_gemm.py (1)
tilelang/language/v2/dtypes.py (1)
  • float32 (300-300)
src/tl_templates/hip/common.h (3)
src/tl_templates/cuda/reduce.h (3)
  • tl (10-83)
  • half_t (17-19)
  • bfloat16_t (20-22)
src/tl_templates/cuda/common.h (5)
  • tl (248-342)
  • tl (571-573)
  • tl (574-576)
  • tl (591-671)
  • bfloat16_t (581-582)
src/tl_templates/cuda/atomic.h (1)
  • half_t (24-26)
src/tl_templates/hip/debug.h (2)
tilelang/language/print_op.py (1)
  • print_var (15-25)
tilelang/language/v2/dtypes.py (2)
  • short (235-235)
  • long (238-238)
testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (2)
testing/python/amd/test_tilelang_test_amd.py (1)
  • run_gemm_rs (195-239)
maint/gemm_v2/correctness_evaluation.py (3)
  • run_gemm_rs (178-210)
  • run_gemm_sr (262-295)
  • run_gemm_rr (350-383)
src/tl_templates/cuda/reduce.h (1)
src/tl_templates/hip/reduce.h (2)
  • run (121-181)
  • run (188-259)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (54)
testing/python/jit/test_tilelang_jit_gemm_cython.py (2)

169-169: LGTM: Improved numerical precision.

Upgrading the accumulator dtype to T.float32 while keeping fp16 inputs/outputs improves numerical stability by reducing accumulation errors during matrix multiplication. This aligns with industry best practices for GEMM operations.


211-211: LGTM: Consistent precision upgrades across all test variants.

The accumulator dtype has been consistently upgraded to T.float32 across all test scenarios (do_bench, multi_stream, dynamic_shape variants). This ensures uniform numerical behavior and improved accuracy throughout the test suite.

Also applies to: 255-255, 304-308, 357-357

testing/python/jit/test_tilelang_jit_gemm.py (1)

106-106: LGTM: Test name and configuration properly updated.

The test function has been appropriately renamed to test_gemm_f16f16f32_nn_kernel_jit to reflect the dtype configuration (fp16 inputs, fp16 outputs, fp32 accumulation), and the accumulator dtype parameter has been consistently updated to T.float32. This change aligns with the precision upgrades across the test suite.

Also applies to: 115-115

testing/python/language/test_tilelang_language_alias.py (1)

48-48: This follows the standard codebase pattern; no additional verification needed for automatic target selection.

The compile call without explicit target parameter is consistent with the existing codebase convention. Most tests in the repository rely on automatic target detection, so this change aligns with standard practice rather than introducing a new mechanism. The test will work with the default target available in the environment.

testing/python/language/test_tilelang_language_unroll.py (3)

20-21: Good addition of TODO comment and CUDA-gating decorator.

The TODO comment clearly documents the HIP limitation for unroll factor, and the @tilelang.testing.requires_cuda decorator appropriately gates this test to CUDA-only environments. This is consistent with the PR's goal of enabling ROCm CI while properly handling platform-specific limitations.


32-33: Consistent removal of explicit target parameter.

The removal of the explicit target="cuda" parameter is consistent with test_unroll_with_step (line 16) and aligns with the broader PR pattern of relying on environment-based default targets. Since this test is gated with @tilelang.testing.requires_cuda, the default target will correctly resolve to CUDA when the test executes.


16-17: No changes needed. The test design is correct as is.

The basic unroll with step (line 16) is intentionally not gated to CUDA because it is supported on both CUDA and HIP. HIP codegen does generate #pragma unroll for unrolled loops, making the assertion valid on both platforms. The unroll_factor variant (line 22) is correctly gated to CUDA-only because HIP does not support the #pragma unroll <N> form—as documented in the TODO comment on line 20.

.github/workflows/ci.yml (1)

387-388: LGTM: Temporary test exclusion for ROCm.

The changes appropriately exclude the runtime and transform test directories from ROCm CI runs while they're being repaired, and broaden test discovery from ./python/amd to ./python with explicit ignores. This is a reasonable temporary solution.

testing/python/language/test_tilelang_language_vectorize.py (1)

127-127: LGTM: Appropriate CUDA gating for vectorization test.

The decorator correctly gates this vectorization test to CUDA-enabled environments, aligning with the PR's objective to partition tests by backend capability.

testing/python/autotune/test_tilelang_autotune.py (3)

262-262: LGTM: CUDA gating for autotuner configuration test.

The decorator appropriately restricts this test to CUDA-enabled environments, consistent with the broader PR pattern.


268-268: LGTM: CUDA gating for autotuner matmul test.

The decorator correctly restricts this test to CUDA-enabled environments.


250-254: Removal of explicit target="auto" is safe and preserves behavior.

The target parameter defaults to None in set_compile_args, which triggers fallback to the TILELANG_TARGET environment variable that itself defaults to "auto". Removing the explicit target="auto" parameter does not change behavior—it relies on the same default mechanism that produces identical results. The test now uses the environment variable defaults instead of explicit parameter passing, both of which resolve to the same target selection strategy.

Note: The example file examples/gemm/example_gemm_autotune.py still explicitly specifies target="auto" at line 153, maintaining consistency with the documented pattern, though both approaches are functionally equivalent.

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm_sp_v2.py (4)

156-156: LGTM: CUDA gating for sparse GEMM test.

The decorator appropriately restricts the test_gemm_ss sparse GEMM test to CUDA-enabled environments.


307-307: LGTM: CUDA gating for register-shared sparse GEMM test.

The decorator appropriately restricts the test_gemm_rs test to CUDA-enabled environments.


458-458: LGTM: CUDA gating for shared-register sparse GEMM test.

The decorator appropriately restricts the test_gemm_sr test to CUDA-enabled environments.


613-613: LGTM: CUDA gating for register-register sparse GEMM test.

The decorator appropriately restricts the test_gemm_rr test to CUDA-enabled environments.

testing/python/jit/test_tilelang_jit_cutedsl.py (4)

158-158: LGTM: CUDA gating for CuTeDSL JIT kernel test.

The decorator appropriately restricts this CuTeDSL compilation test to CUDA-enabled environments, which is necessary since CuTeDSL is a CUDA-specific backend.


210-210: LGTM: CUDA gating for CuTeDSL benchmarking test.

The decorator correctly restricts the benchmarking test to CUDA-enabled environments.


253-253: LGTM: CUDA gating for CuTeDSL multi-stream test.

The decorator appropriately restricts the multi-stream test to CUDA-enabled environments.


303-303: LGTM: CUDA gating for CuTeDSL dynamic shape test.

The decorator correctly restricts the dynamic shape test to CUDA-enabled environments.

testing/python/cache/test_tilelang_kernel_cache.py (1)

121-121: LGTM! CUDA decorator is appropriate here.

This test executes GPU kernels (lines 190-191) and creates CUDA tensors (lines 187-188), so the @tilelang.testing.requires_cuda decorator is correctly applied.

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

159-159: LGTM! CUDA decorators are appropriate for NVRTC tests.

All six tests use the NVRTC (NVIDIA Runtime Compiler) backend which is CUDA-specific. The tests also create CUDA tensors and use CUDA-specific APIs, so the @tilelang.testing.requires_cuda decorators are correctly applied.

Also applies to: 211-211, 254-254, 304-304, 376-376, 390-390

testing/python/tilelibrary/test_tilelang_tilelibrary_gemm.py (4)

114-118: LGTM! Improved accumulator precision for float16 inputs.

Changing dtypeAccum from T.float16 to T.float32 improves numerical stability and reduces accumulation errors in GEMM operations, which is a best practice for mixed-precision computation.


271-275: LGTM! Improved accumulator precision for float16 inputs.

Changing dtypeAccum from T.float16 to T.float32 improves numerical stability, consistent with the changes in other GEMM variants.


427-431: LGTM! Improved accumulator precision for float16 inputs.

Changing dtypeAccum from T.float16 to T.float32 improves numerical stability, consistent with other GEMM variants.


590-593: LGTM! Improved accumulator precision for float16 inputs.

Changing dtypeAccum from T.float16 to T.float32 improves numerical stability, consistent with other GEMM variants.

src/tl_templates/hip/debug.h (6)

6-21: LGTM: Clean fallback implementation.

The primary template provides a safe fallback by printing the address for unknown types, maintaining a consistent interface for both print_var and print_buffer.


60-74: LGTM: Clear bool representation.

The bool specialization correctly prints "true" or "false" strings instead of numeric values, making debug output more readable.


76-90: LGTM: Safe pointer printing.

The pointer specialization correctly uses %p format and casts to void*, following best practices for printing pointer addresses.


92-101: LGTM: Clean delegation to trait system.

The public interface functions correctly delegate to the trait system, maintaining a clean separation between the API and implementation details.


53-54: Types half_t and bfloat16_t are properly defined and support float conversion.

Both types are correctly aliased in src/tl_templates/hip/common.h: half_t maps to float16_t and bfloat16_t maps to hip_bfloat16. These are standard HIP types with implicit conversion to float, making the print trait definitions valid.


4-4: The FP8 type definitions in hip_fp8.h are properly defined. Both fp8_e4_t and fp8_e5_t are correctly aliased to HIP's native FP8 types (__hip_fp8_e4m3_fnuz and __hip_fp8_e5m2_fnuz respectively from <hip/amd_detail/amd_hip_fp8.h>), which natively support conversion to float. The macro specializations at lines 56-57 in debug.h will work correctly with these types.

src/tl_templates/cuda/reduce.h (1)

177-186: LGTM: Correct signature and indexing fixes.

The changes properly:

  1. Update CumSum2D::run to return void since results are written directly to dst
  2. Use the SEG template parameter for lane and row computation instead of hardcoded 32, ensuring consistency with the segment size

Note: The past review comment about src/op/reduce.cc line 570 using dst->dtype instead of DataType::Handle() for the call to this void-returning function remains relevant and should be addressed separately.

src/tl_templates/hip/atomic.h (1)

64-104: LGTM: Vectorized atomics match expected per-element semantics.

The AtomicAddx2/x4 and AtomicAddx2Ret/x4Ret implementations correctly perform per-element atomic operations. The *Ret variants return the old values from each individual atomic, which is the expected behavior (not a single atomic operation on the entire vector).

src/tl_templates/hip/common.h (1)

132-148: LGTM: HIP shuffle wrappers with noted sync limitation.

The TODO on line 132 correctly notes that ROCm 7.1.1+ will provide shfl_sync variants. The current implementation using non-sync HIP intrinsics is appropriate for current ROCm versions.

The generic templates delegate to raw HIP intrinsics, with specializations (lines 151-198) handling half_t and bfloat16_t via float conversion, matching the pattern in the CUDA counterpart.

src/tl_templates/hip/reduce.h (5)

76-77: LGTM: Correct use of type-safe shuffle wrapper.

Using tl::shfl_down instead of the raw __shfl_down intrinsic ensures proper handling of half_t and bfloat16_t types through the float-conversion specializations defined in common.h.


107-108: LGTM: Consistent use of tl::shfl_xor wrapper.


117-182: LGTM: CumSum1D implementation correctly uses HIP warp size.

The implementation properly:

  • Uses SEG=64 default matching AMD's warp/wavefront size
  • Employs tl::shfl_* wrappers throughout for type safety
  • Handles both forward and reverse prefix-sum modes
  • Early-exits threads beyond SEG (line 129-130) since only one warp performs the scan

184-260: LGTM: CumSum2D implementation with axis-aware indexing.

The implementation correctly:

  • Supports cumulative sum along either axis via the Axis template parameter
  • Uses proper real_row/real_col mapping based on axis selection
  • Employs tl::shfl_* wrappers for type-safe shuffling
  • Handles boundary conditions and early exits appropriately

273-291: LGTM: Convenience wrappers for common reduction operations.

These wrappers correctly delegate to warp_reduce with the appropriate operator. They will function correctly once the underlying warp_reduce issue (using tl::shfl_xor) is addressed.

testing/python/jit/test_tilelang_jit_callback.py (2)

4-4: LGTM! Import addition mirrors the CUDA callback pattern.

The import of register_hip_postproc_callback is necessary for HIP/ROCm support and follows the existing pattern for CUDA callbacks.


93-96: HIP callback correctly mirrors the CUDA callback pattern.

The HIP post-processing callback is implemented consistently with the CUDA callback at lines 88-91. Both callbacks are registered in the same scope and will coexist without conflict (the runtime determines which is invoked based on the target).

Note: test_cuda_postproc_callback is skipped (line 107), and test_gemm_jit_kernel doesn't verify callback behavior—it only tests computational correctness. Consider enabling or adding tests that verify the HIP callback is invoked on ROCm hardware.

tilelang/intrinsics/mfma_layout.py (3)

3-3: LGTM!

The import addition is necessary for the const(0) usage in the layout functions.


18-20: Same dtype verification applies here.

This change mirrors line 9. Ensure the default int32 dtype for const(0) is appropriate for ROCm compilation.


7-9: This change is intentional for HIP/ROCm compatibility and requires no modifications.

The use of const(0) was introduced specifically to fix HIP build issues. The default int32 dtype is appropriate for layout index calculations, where both thread_id (computed from arithmetic) and the constant offset are int32-compatible. The consistent pattern across both functions confirms this is the correct approach for ROCm targets.

tilelang/jit/adapter/wrapper.py (1)

653-656: LGTM! HIP-specific declaration extraction is well-designed.

The override correctly handles HIP kernel syntax by splitting at the opening brace instead of a semicolon. The comment clearly explains why this differs from CUDA, and the approach aligns with HIP's inline kernel definition style.

💡 Optional: Add defensive check

For added robustness, you could verify the "{" exists:

 def get_declaration(self, declare_kernel_code: str) -> str:
     # HIP code dont have function declaration, so we use '{\n' to split
     # __global__ void __launch_bounds__(128) kernel_kernel(float* __restrict__ A) {\n
-    return declare_kernel_code.split("{")[0]
+    parts = declare_kernel_code.split("{", 1)
+    if len(parts) < 2:
+        logger.warning(f"Expected opening brace in HIP kernel, got: {declare_kernel_code[:100]}")
+    return parts[0]
testing/python/language/test_tilelang_language_atomic_add.py (4)

210-211: LGTM! Consistent dtype parameterization.

The dtype parameter propagation follows the established pattern used in other test runner functions throughout the file.


238-238: LGTM! Appropriate CUDA gating added.

The @requires_cuda decorators correctly gate CUDA-specific atomic operations, preventing test failures on non-CUDA platforms.

Also applies to: 243-243


248-250: LGTM! Well-formed half-precision test.

The test correctly exercises atomic_addx2 with float16, includes the required @requires_cuda decorator, and follows the established test pattern.


353-353: LGTM! Proper CUDA gating for memory-order tests.

The decorator correctly gates CUDA-specific atomic memory ordering operations.

testing/python/language/test_tilelang_language_annotate_safe_value.py (1)

31-31: Default target selection is correctly implemented. The removal of explicit target="cuda" is appropriate—the tilelang.compile() function uses environment variable TILELANG_TARGET (defaults to "auto"), which triggers determine_target() to intelligently detect available backends (CUDA, then HIP, then Metal). PyTorch's device="cuda" is compatible with ROCm systems via the CUDA compatibility layer, and @tilelang.testing.requires_cuda from TVM's testing utilities works correctly on both CUDA and ROCm. The code properly supports multi-backend compilation as intended.

testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)

168-168: LGTM: Accumulation dtype updated to float32 for numerical stability.

The change from T.float16 to T.float32 for dtypeAccum across all test invocations aligns with the reference implementation which uses torch.float (float32) for matmul computation. Accumulating in higher precision is a standard practice to avoid numerical errors.

Also applies to: 211-211, 253-253, 302-307

testing/python/language/test_tilelang_language_alloc.py (2)

118-121: LGTM - Appropriate CUDA gating for unsupported ROCm tests.

The @tilelang.testing.requires_cuda decorator correctly restricts these tests to CUDA until ROCm support is implemented. The TODO comment clearly documents the temporary limitation.


36-36: Both "tmp =" and "tmp[0] =" are valid code generation patterns for single-element buffers. Since alloc_var creates a buffer with shape [1], different backend optimization passes may generate either the scalar form or the explicit array indexing form. The OR assertion correctly accommodates this legitimate variation and is appropriate for supporting multiple backends with different codegen strategies.

T.float16,
T.float16,
T.float16,
T.float32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the test file to understand the context
cat -n testing/python/jit/test_tilelang_jit_callback.py | head -150

Repository: tile-ai/tilelang

Length of output: 4675


🏁 Script executed:

# Also check around line 232
cat -n testing/python/jit/test_tilelang_jit_callback.py | sed -n '220,240p'

Repository: tile-ai/tilelang

Length of output: 571


🏁 Script executed:

# Search for float16 accumulation patterns in the codebase
rg -n 'accum_dtype.*float16' testing/ --type py

Repository: tile-ai/tilelang

Length of output: 549


🏁 Script executed:

# Search for float16 in gemm or matmul related tests
rg -n -B 5 -A 5 'float16.*float16.*float16' testing/ --type py | head -100

Repository: tile-ai/tilelang

Length of output: 8666


🏁 Script executed:

# Look for ROCm-related comments or issues mentioning dtype precision
rg -n -i 'rocm.*precision|rocm.*dtype|rocm.*float16|rocm.*float32|mfma.*precision|fp8.*accumul' . --type py --type md --type txt | head -50

Repository: tile-ai/tilelang

Length of output: 883


The accumulation dtype change to float32 is inconsistent with other GEMM tests in the codebase.

This change is specific to these two test functions, while other GEMM tests (e.g., test_tilelang_kernel_gemm_mma_intrinsic.py, test_tilelang_jit_nvrtc.py, test_tilelang_jit_cutedsl.py) continue to use float16 for accumulation. The active test test_gemm_jit_kernel uses loose tolerances (atol=1e-2, rtol=1e-2, max_mismatched_ratio=0.05), suggesting this change addresses a numerical precision issue specific to these tests.

Clarify:

  1. Is this change specific to callback/JIT compilation mechanisms, or should similar updates apply to other GEMM test variants?
  2. If callback/JIT-specific, document why float32 accumulation is required for these paths.
🤖 Prompt for AI Agents
In testing/python/jit/test_tilelang_jit_callback.py around line 117, the
accumulation dtype was changed to T.float32 which is inconsistent with other
GEMM tests that use float16; either revert the accumulation dtype back to
float16 to match the rest of the GEMM test suite, or if float32 is required for
callback/JIT paths, add a concise comment above the test (or a docstring)
explaining that this path needs higher-precision accumulation, why (e.g.,
numerical instability in callback/JIT compilation), and adjust the test metadata
(tolerance or a targeted marker) so the deviation is explicit; after making the
chosen change, run the related GEMM tests to confirm behavior is consistent.

)


@tilelang.testing.requires_cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at line 387 and surrounding context in the test file
sed -n '380,400p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 666


🏁 Script executed:

# Let's also check the entire structure of test functions in the file
rg -n '^\s*def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | head -20

Repository: tile-ai/tilelang

Length of output: 323


🏁 Script executed:

# Check what decorators are applied to test functions
rg -n -B 2 'def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E '(@|def test_)' | head -30

Repository: tile-ai/tilelang

Length of output: 359


🏁 Script executed:

# Check test_gemm_jit_kernel (starting at line 159)
sed -n '159,209p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1092


🏁 Script executed:

# Check test_tvm_ffi_kernel_do_bench (starting at line 210)
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1223


🏁 Script executed:

# Check test_tvm_ffi_kernel_multi_stream (starting at line 252)
sed -n '252,300p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1438


🏁 Script executed:

# Check test_tvm_ffi_dynamic_shape (starting at line 301)
sed -n '301,373p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 3152


🏁 Script executed:

# Check test_tvm_ffi_im2col_tma_desc (starting at line 374)
sed -n '374,387p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 523


🏁 Script executed:

# Check test_tvm_ffi_pdl (starting at line 455)
sed -n '455,500p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1417


🏁 Script executed:

# Find run_gemm_jit_kernel function
rg -n "def run_gemm_jit_kernel" testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 88


🏁 Script executed:

# Check the beginning of the file for imports and setup
head -100 testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 3240


🏁 Script executed:

# Look for pytest.skip or skip mechanisms at test start
rg -n "pytest.skip|pytest.mark.skipif" testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 205


🏁 Script executed:

# Check the run_gemm_jit_kernel function
sed -n '102,158p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1165


🏁 Script executed:

# Check if there are any global CUDA checks at the module level
grep -n "cuda\|requires_cuda" testing/python/jit/test_tilelang_jit_tvm_ffi.py | head -20

Repository: tile-ai/tilelang

Length of output: 1252


🏁 Script executed:

# Check the file structure - look for conftest or any module-level skips
head -150 testing/python/jit/test_tilelang_jit_tvm_ffi.py | tail -50

Repository: tile-ai/tilelang

Length of output: 988


🏁 Script executed:

# Check if test_tvm_ffi_kernel_do_bench uses CUDA-specific features through profiler
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E "(cuda|\.cuda\(\))"

Repository: tile-ai/tilelang

Length of output: 305


🏁 Script executed:

# Verify what decorators are on test_gemm_jit_kernel and test_tvm_ffi_kernel_do_bench
sed -n '155,165p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 284


🏁 Script executed:

# Check if there's a conftest.py that might skip these tests globally
find testing/python/jit -name "conftest.py" -exec cat {} \;

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check what profiler is and whether do_bench requires CUDA
rg -n "get_profiler\|do_bench" testing/python/jit/test_tilelang_jit_tvm_ffi.py -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check if tvm_ffi backend execution happens in test_tvm_ffi_kernel_do_bench
sed -n '225,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 789


🏁 Script executed:

# Check the full test_tvm_ffi_kernel_do_bench to see if profiler execution happens
sed -n '210,251p' testing/python/jit/test_tilelang_jit_tvm_ffi.py

Repository: tile-ai/tilelang

Length of output: 1223


🏁 Script executed:

# List all test functions and their decorators/gates
rg -B 3 '^\s*def test_' testing/python/jit/test_tilelang_jit_tvm_ffi.py | grep -E '(@|def test_|pytest.skip)' | head -40

Repository: tile-ai/tilelang

Length of output: 327


Add @tilelang.testing.requires_cuda decorator to ungated tests using CUDA APIs.

The @tilelang.testing.requires_cuda decorator is correctly applied to test_tvm_ffi_l2_persistent_map. However, several other tests in this file use CUDA-specific APIs without proper gating:

  • test_gemm_jit_kernel: calls run_gemm_jit_kernel which allocates tensors with .cuda()
  • test_tvm_ffi_kernel_do_bench: calls run_tvm_ffi_kernel_do_bench which allocates tensors with .cuda()
  • test_tvm_ffi_kernel_multi_stream: explicitly uses torch.cuda.Stream() and .cuda()
  • test_tvm_ffi_dynamic_shape: allocates tensors with .cuda()

These tests will fail on non-CUDA systems. Add @tilelang.testing.requires_cuda to each of these test functions to skip them appropriately. Consider also standardizing the other tests (test_tvm_ffi_im2col_tma_desc, test_tvm_ffi_pdl) which currently use custom pytest.skip() gates instead of the decorator.

🤖 Prompt for AI Agents
In testing/python/jit/test_tilelang_jit_tvm_ffi.py around line 387, several
tests that use CUDA APIs are not properly gated: add the
@tilelang.testing.requires_cuda decorator above test_gemm_jit_kernel,
test_tvm_ffi_kernel_do_bench, test_tvm_ffi_kernel_multi_stream, and
test_tvm_ffi_dynamic_shape so they are skipped on non-CUDA systems; also
consider replacing the pytest.skip() guards in test_tvm_ffi_im2col_tma_desc and
test_tvm_ffi_pdl with the same @tilelang.testing.requires_cuda decorator for
consistency.

profiler.assert_allclose(ref_program, atol=1e-2, rtol=1e-2)


@tilelang.testing.requires_cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the test file to verify the decorator locations and patterns
cd testing/python/kernel && cat -n test_tilelang_kernel_gemm.py | head -120 | tail -30

Repository: tile-ai/tilelang

Length of output: 176


🏁 Script executed:

# Let's see more of the file to understand the test patterns better
cd testing/python/kernel && sed -n '100,300p' test_tilelang_kernel_gemm.py | cat -n

Repository: tile-ai/tilelang

Length of output: 176


🏁 Script executed:

# Search for ROCm-related decorators, comments, or configuration in the codebase
rg -n 'requires_rocm|@tilelang.testing' --type=py -C2 | head -100

Repository: tile-ai/tilelang

Length of output: 7521


🏁 Script executed:

# Search for any ROCm documentation or issues
rg -n -i 'rocm|amd' --type=py --type=md -C2 | head -80

Repository: tile-ai/tilelang

Length of output: 7918


🏁 Script executed:

# Check the structure of decorators in the tilelang.testing module
fd -e py | xargs rg -l 'requires_cuda|requires_rocm' | head -10

Repository: tile-ai/tilelang

Length of output: 582


🏁 Script executed:

sed -n '100,280p' testing/python/kernel/test_tilelang_kernel_gemm.py | cat -n

Repository: tile-ai/tilelang

Length of output: 4227


🏁 Script executed:

# Check if there's a requires_rocm decorator available and how it's used
rg -n 'def test_gemm' testing/python/kernel/test_tilelang_kernel_gemm.py | head -20

Repository: tile-ai/tilelang

Length of output: 499


🏁 Script executed:

# Look at the complete context of line 237 mentioned in the scratchpad
sed -n '230,245p' testing/python/kernel/test_tilelang_kernel_gemm.py | cat -n

Repository: tile-ai/tilelang

Length of output: 441


🏁 Script executed:

# Check if there are any GitHub issues or documentation about ROCm support
rg -n -i 'github.com|issue|TODO.*rocm' --type=py testing/python/kernel/test_tilelang_kernel_gemm.py

Repository: tile-ai/tilelang

Length of output: 124


🏁 Script executed:

# Check the full PR context and commit message if available
git log --oneline -20 2>/dev/null | head -10

Repository: tile-ai/tilelang

Length of output: 90


🏁 Script executed:

# Look for any related issues or comments about ROCm support
rg -i 'rocm|precision.*issue' testing/python/kernel/test_tilelang_kernel_gemm.py -A2 -B2

Repository: tile-ai/tilelang

Length of output: 184


🏁 Script executed:

# Check if there are more TODO comments about ROCm precision
rg -n 'TODO' testing/python/kernel/test_tilelang_kernel_gemm.py

Repository: tile-ai/tilelang

Length of output: 124


Document ROCm incompatibilities consistently across gated tests.

The @tilelang.testing.requires_cuda decorators on these tests appear intentional due to ROCm precision issues, as evidenced by the explicit TODO at line 237 ("Meets precision issue on ROCm, disable for now"). However, this rationale is documented only for test_gemm_f32f32f32_tn. The other gated tests—test_gemm_f16f16f16_nn, test_gemm_f16f16f16_tn, test_gemm_f16f16f16_nt, test_gemm_f64f64f64_nt, test_pad_aligned_f16f16f16_nn, and test_pad_f16f16f16_nn—lack comments explaining their ROCm incompatibility.

Either:

  1. Add similar TODO comments to the other gated tests if they share the same precision issue, or
  2. Clarify if these tests are gated for different reasons.
🤖 Prompt for AI Agents
In testing/python/kernel/test_tilelang_kernel_gemm.py around line 106, several
tests are gated with @tilelang.testing.requires_cuda but only
test_gemm_f32f32f32_tn has a TODO explaining ROCm precision issues; update the
other gated tests (test_gemm_f16f16f16_nn, test_gemm_f16f16f16_tn,
test_gemm_f16f16f16_nt, test_gemm_f64f64f64_nt, test_pad_aligned_f16f16f16_nn,
test_pad_f16f16f16_nn) by adding the same TODO comment above their
@requires_cuda decorators if they share the ROCm precision problem, or replace
the decorator/comment with a brief note clarifying the actual reason if it
differs.

Comment on lines +237 to 239
# TODO(Gong): Meets precision issue on ROCm, disable for now
@tilelang.testing.requires_cuda
def test_gemm_f32f32f32_tn():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add issue tracker reference and investigate the transpose-specific precision issue.

The TODO comment indicates a known ROCm precision issue but lacks a tracking reference. Additionally, only the TN (transpose A) variant fails, while test_gemm_f32f32f32_nt (line 221) and test_gemm_f32f32f32_nn (line 156) remain ungated, suggesting the issue is specific to the transpose combination rather than the f32 dtype.

Please:

  1. Create a tracking issue and reference it in the TODO comment (e.g., TODO(Gong): #XXXX ROCm precision issue...)
  2. Document what specific precision failure occurs (numerical error magnitude, wrong results, etc.)
  3. Investigate why only the TN variant fails when NN and NT variants work

Would you like me to generate a script to help identify the precision issue or open a tracking issue?

🤖 Prompt for AI Agents
In testing/python/kernel/test_tilelang_kernel_gemm.py around lines 237-239, the
TODO lacks an issue tracker reference and no details on the ROCm precision
failure that affects only the TN (transpose A) variant; create a tracking issue
in the repo (e.g. #XXXX) and update the TODO to: "TODO(Gong): #XXXX ROCm
precision issue on TN variant — describe observed failure", then run focused
diagnostics capturing failing test outputs (absolute/relative error magnitudes,
mismatched indices/values, and reproduction inputs), add a short comment
summarizing the numeric error (e.g., max abs/rel error and example mismatched
element), and investigate code paths unique to A-transpose (kernel math, memory
layout, stride handling) to identify the root cause or add a targeted test-gate
referencing the new issue if temporary disabling is required.

Comment on lines +43 to 45
@tilelang.testing.requires_cuda
def test_tilelang_copy():
run_tilelang_copy(M=1024, N=1024, block_M=128, block_N=128, pad_value=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the tilelang.testing module and locate requires_cuda decorator
find . -type f -name "*.py" -path "*testing*" | head -20

Repository: tile-ai/tilelang

Length of output: 1181


🏁 Script executed:

#!/bin/bash
# Search for requires_cuda decorator definition
rg -n "def requires_cuda" --type=py -A 15

Repository: tile-ai/tilelang

Length of output: 2432


🏁 Script executed:

#!/bin/bash
# Search for any GPU-related decorators in testing module
rg -n "requires_.*(?:cuda|gpu)" --type=py -A 10

Repository: tile-ai/tilelang

Length of output: 50373


🏁 Script executed:

#!/bin/bash
# Check if there are references to ROCm or HIP in testing utilities
rg -n "rocm|hip|gpu|cuda" --type=py -i -l path="*testing*" | head -10

Repository: tile-ai/tilelang

Length of output: 114


🏁 Script executed:

#!/bin/bash
# Find the requires_cuda function definition
rg -n "^def requires_cuda" tilelang/testing/ -A 20

Repository: tile-ai/tilelang

Length of output: 2671


🏁 Script executed:

#!/bin/bash
# Get the full tilelang/testing/__init__.py file to see all decorators
wc -l tilelang/testing/__init__.py
head -150 tilelang/testing/__init__.py

Repository: tile-ai/tilelang

Length of output: 3963


🏁 Script executed:

#!/bin/bash
# Check if there are any ROCm or HIP-related decorators
rg -n "rocm|hip" tilelang/testing/ --type=py

Repository: tile-ai/tilelang

Length of output: 261


🏁 Script executed:

#!/bin/bash
# Look at the actual test file that was being reviewed
cat -n testing/python/language/test_tilelang_language_annotate_safe_value.py | head -50

Repository: tile-ai/tilelang

Length of output: 2095


Reconcile decorator with ROCm CI goals or make test device-agnostic.

The @tilelang.testing.requires_cuda decorator (from TVM's testing utilities) will skip this test on ROCm systems. Since the PR aims to "Open Rocm ci test", either use @tilelang.testing.requires_rocm for ROCm execution, detect the available GPU device dynamically instead of hardcoding device="cuda" on line 32, or use a more generic approach that supports both CUDA and ROCm. Note that requires_rocm is already available in tilelang.testing (exported from TVM).

🤖 Prompt for AI Agents
testing/python/language/test_tilelang_language_annotate_safe_value.py around
lines 43-45: the test is decorated with @tilelang.testing.requires_cuda which
will skip on ROCm; make the test device-agnostic by either replacing the
decorator with a ROCm-aware check or by detecting the available GPU backend at
runtime and using that device for the test. Concretely, remove or replace
@tilelang.testing.requires_cuda with a combined check (e.g., require either CUDA
or ROCm) or use no decorator and instead determine device = "cuda" if CUDA
available else "rocm" (or skip if neither), then pass that device variable into
run_tilelang_copy (also change the hardcoded device="cuda" on line ~32
accordingly) so the test runs on both CUDA and ROCm CI.

Comment on lines +198 to +200
def atomic_addx2_program(M, N, block_M, block_N, dtype=T.float16):
@T.prim_func
def atomic_addx2(A: T.Tensor((M, N), T.float16), B: T.Tensor((M, N), T.float16)):
def atomic_addx2(A: T.Tensor((M, N), dtype), B: T.Tensor((M, N), dtype)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check atomic_addx2 dtype constraints and documentation

echo "=== Searching for atomic_addx2 in atomic.py ==="
fd "atomic.py" | head -5

Repository: tile-ai/tilelang

Length of output: 176


🏁 Script executed:

#!/bin/bash
# Find and inspect the atomic.py file
find . -name "atomic.py" -type f | grep -E "tilelang.*atomic\.py"

Repository: tile-ai/tilelang

Length of output: 89


🏁 Script executed:

cat ./tilelang/language/atomic.py

Repository: tile-ai/tilelang

Length of output: 18000


🏁 Script executed:

cat -n ./testing/python/language/test_tilelang_language_atomic_add.py | sed -n '190,260p'

Repository: tile-ai/tilelang

Length of output: 2571


🏁 Script executed:

grep -n "atomic_addx2_float\|atomic_addx2_half\|test_atomic_addx2" ./testing/python/language/test_tilelang_language_atomic_add.py

Repository: tile-ai/tilelang

Length of output: 128


🏁 Script executed:

# Check if atomic_addx2 is used elsewhere with different dtypes
rg "atomic_addx2" --type=py -B2 -A2

Repository: tile-ai/tilelang

Length of output: 5052


Address dtype constraint mismatch between docstring and test.

The atomic_addx2 docstring documents support for FP16/BF16 only (see examples), yet the test file includes test_atomic_addx2_float() which explicitly calls the function with dtype=T.float32. This creates a contradiction: either float32 should be documented as supported, or the test and dtype parameter should be constrained to half-precision types. The implementation currently has no dtype validation to enforce the documented constraints.

Consider one of the following:

  • Update the docstring to document float32 support if it is intentionally supported
  • Restrict the dtype parameter in atomic_addx2_program to only accept FP16/BF16 and move float32 tests to use atomic_addx4 (which is designed for quad-width float32)
  • Add runtime assertions to validate dtype compatibility with the atomic operation
🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_atomic_add.py around lines
198-200, the test and docstring disagree on supported dtypes: the docstring
states atomic_addx2 supports FP16/BF16 but the test calls it with T.float32; add
a runtime dtype check and align tests/docs by restricting atomic_addx2 to
half-precision only. Update atomic_addx2_program to assert dtype is either
T.float16 or T.bfloat16 (raise a clear error if not), update the docstring to
explicitly list only FP16/BF16, and change test_atomic_addx2_float() to either
use atomic_addx4 for float32 or remove/rename the float32 test to target the
correct quad-width primitive so implementation, tests, and docs remain
consistent.

Comment on lines +31 to 32
kernel = tilelang.compile(program, out_idx=[1], pass_configs={"tl.disable_warp_specialized": True, "tl.disable_tma_lower": True})
a = torch.randn(M, N, device="cuda", dtype=getattr(torch, dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the tilelang compile function to understand default target behavior
find . -name "*.py" -type f | head -20

Repository: tile-ai/tilelang

Length of output: 1083


🏁 Script executed:

# Search for the compile function definition
rg -n "def compile" --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 863


🏁 Script executed:

# Look for TILELANG_TARGET environment variable usage
rg -n "TILELANG_TARGET" --type=py -C2

Repository: tile-ai/tilelang

Length of output: 4831


🏁 Script executed:

# Check if test gating decorators exist in the test file
rg -n "@.*cuda\|@.*hip\|@.*requires\|skipif\|pytest.mark" testing/python/language/test_tilelang_language_mask_op.py

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Search for test gating decorators in other language test files for comparison
rg -n "@.*cuda\|@.*hip\|@.*requires\|skipif\|pytest.mark" testing/python/language/ --type=py | head -30

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Get the compile function signature and relevant code
sed -n '47,120p' tilelang/jit/__init__.py

Repository: tile-ai/tilelang

Length of output: 3255


🏁 Script executed:

# Check the implementation of how target="auto" is handled
rg -n "auto.*target\|target.*auto" tilelang/jit/__init__.py -C3 | head -40

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Check the actual test file to see its structure and decorators
cat -n testing/python/language/test_tilelang_language_mask_op.py

Repository: tile-ai/tilelang

Length of output: 6381


🏁 Script executed:

# Search for how target resolution works - check for "auto" handling in backends
rg -n "def.*resolve.*target|target.*==.*auto|if.*auto" tilelang/ --type=py -C2 | head -50

Repository: tile-ai/tilelang

Length of output: 3204


🏁 Script executed:

# Look for test gating patterns in other Python test files
find testing/python -name "*.py" -type f | head -10

Repository: tile-ai/tilelang

Length of output: 580


🏁 Script executed:

# Search for pytest.mark.skipif or CUDA-specific decorators in test files
rg -n "pytest.mark.skipif|@.*cuda|@.*rocm|@.*requires_gpu" testing/python/ --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 2016


🏁 Script executed:

# Check if there's a pattern for conditional test decorators
rg -n "def.*test_|class.*Test" testing/python/language/ --type=py -A2 | head -40

Repository: tile-ai/tilelang

Length of output: 2966


🏁 Script executed:

# Check if there's a tilelang.testing.requires_cuda decorator and how it's defined
rg -n "requires_cuda|requires_rocm" tilelang/testing/ --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 1837


🏁 Script executed:

# Check other language test files to see if they have decorators
ls -la testing/python/language/test_*.py | head -10

Repository: tile-ai/tilelang

Length of output: 1135


🏁 Script executed:

# Check if any language tests have been gated
rg -n "@.*requires" testing/python/language/ --type=py

Repository: tile-ai/tilelang

Length of output: 5286


🏁 Script executed:

# Look at a language test file that's similar to understand the pattern
cat -n testing/python/language/test_tilelang_language_vectorized_cast.py | head -120

Repository: tile-ai/tilelang

Length of output: 4956


Add test gating decorators to CUDA-specific tests.

The test functions hardcode device="cuda" in torch tensor creation but lack @tilelang.testing.requires_cuda decorators. This will cause tests to fail on ROCm/HIP systems. Add the decorator to all four test functions (test_tilelang_copy_mask_parallel, test_tilelang_copy_mask_copy, test_tilelang_copy_mask_parallel_range, test_tilelang_copy_mask_copy_range) to match the pattern used throughout other language tests and properly gate them to CUDA-only execution.

🤖 Prompt for AI Agents
In testing/python/language/test_tilelang_language_mask_op.py around lines 31-32,
the tests create tensors with device="cuda" but lack CUDA gating; add the
@tilelang.testing.requires_cuda decorator above each of the four test functions:
test_tilelang_copy_mask_parallel, test_tilelang_copy_mask_copy,
test_tilelang_copy_mask_parallel_range, and test_tilelang_copy_mask_copy_range
so these tests only run on CUDA-enabled environments and avoid failures on
ROCm/HIP systems.

Comment on lines +626 to 627
"float8_e5m2fnuz": "fp8_e5_t",
"float8_e4m3fnuz": "fp8_e4_t",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the wrapper.py file to see the full context of the type maps
head -n 700 tilelang/jit/adapter/wrapper.py | tail -n 120

Repository: tile-ai/tilelang

Length of output: 4344


🏁 Script executed:

# Search for uint64 usage in the codebase
rg -n "uint64" --type=py -C 2

Repository: tile-ai/tilelang

Length of output: 30117


🏁 Script executed:

# Search for float8_e5m2fnuz and float8_e4m3fnuz to verify they are ROCm-specific
rg -n "float8_e5m2fnuz|float8_e4m3fnuz" --type=py -C 1

Repository: tile-ai/tilelang

Length of output: 22364


🏁 Script executed:

# Check if there are differences documented between CUDA and HIP type support
rg -n "TLCUDASourceWrapper|TLHIPSourceWrapper" --type=py -A 5 | head -n 100

Repository: tile-ai/tilelang

Length of output: 4074


🏁 Script executed:

# Get the complete TLCUDASourceWrapper._TYPE_MAP definition
sed -n '168,210p' tilelang/jit/adapter/wrapper.py

Repository: tile-ai/tilelang

Length of output: 1378


Add "uint64": "uint64_t" to the CUDA type map for consistency.

The HIP type map includes "uint64": "uint64_t", but the parent TLCUDASourceWrapper._TYPE_MAP does not have this mapping. Since uint64 is a valid TileLang type and CUDA kernels can use it, add this mapping to the CUDA type map to maintain consistency across backends. The CPU backend also includes this mapping.

Diff
        "float8_e5m2": "fp8_e5_t",
        "float64": "double",
        "int64": "int64_t",
        "int32": "int",
        "uint32": "unsigned int",
+       "uint64": "uint64_t",
        "bool": "int8_t",
🤖 Prompt for AI Agents
In tilelang/jit/adapter/wrapper.py around lines 626-627, the CUDA type map is
missing the "uint64": "uint64_t" entry which exists in the HIP and CPU maps; add
the mapping "uint64": "uint64_t" alongside the other integer mappings (e.g.,
next to "uint32": "unsigned int") so CUDA kernels can use TileLang uint64 types
and the backends remain consistent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (1)

158-173: Several tests still lack CUDA gating decorators.

While decorators were added to test_tvm_ffi_l2_persistent_map and test_tvm_ffi_pdl, the following tests continue to use CUDA APIs without proper gating:

  • test_gemm_jit_kernel (line 158): calls run_gemm_jit_kernel which uses .cuda() on lines 137-138
  • test_tvm_ffi_kernel_do_bench (line 209): uses the TVM FFI backend which requires CUDA
  • test_tvm_ffi_kernel_multi_stream (line 251): explicitly uses torch.cuda.Stream() and .cuda() on lines 235-236, 242, 246
  • test_tvm_ffi_dynamic_shape (line 300): uses .cuda() on lines 285-286, 292

Add @tilelang.testing.requires_cuda decorator to these test functions to ensure they skip gracefully on non-CUDA systems.

Also applies to: 209-211, 251-253, 300-307

🧹 Nitpick comments (3)
testing/python/jit/test_tilelang_jit_cython.py (1)

8-9: Optional: Consider removing redundant decorator.

The @tilelang.testing.requires_cuda decorator on line 8 may be redundant, as @tilelang.testing.requires_cuda_compute_version(9, 0) on line 9 already includes requires_cuda marks internally (as shown in tilelang/testing/__init__.py). While this doesn't cause issues, removing the redundant decorator would simplify the code.

🔎 Proposed simplification
-@tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version(9, 0)
 def test_cython_pdl():
testing/python/jit/test_tilelang_jit_nvrtc.py (2)

375-386: Refactor to use decorator pattern for consistency.

The test_nvrtc_im2col_tma_desc function uses manual check_hopper() + pytest.skip() (lines 378-381) to gate on compute capability 9.0, while test_nvrtc_pdl (lines 441-442) uses the @tilelang.testing.requires_cuda_compute_version(9, 0) decorator for the same purpose. This inconsistency makes the codebase harder to maintain.

🔎 Proposed refactor for consistency
 @tilelang.testing.requires_cuda
+@tilelang.testing.requires_cuda_compute_version(9, 0)
 def test_nvrtc_im2col_tma_desc():
     """Test im2col TMA descriptor with NVRTC backend."""
-    if not check_hopper():
-        import pytest
-
-        pytest.skip("Test requires Hopper GPU (compute capability 9.0)")
-
     # Small test case for im2col TMA descriptor
     run_nvrtc_im2col_tma_desc(

441-442: Optional: Consider removing redundant decorator.

The @tilelang.testing.requires_cuda decorator on line 441 may be redundant, as @tilelang.testing.requires_cuda_compute_version(9, 0) on line 442 already includes requires_cuda marks internally. While this doesn't cause issues, removing the redundant decorator would simplify the code.

🔎 Proposed simplification
-@tilelang.testing.requires_cuda
 @tilelang.testing.requires_cuda_compute_version(9, 0)
 def test_nvrtc_pdl():
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ea2b55 and 413522f.

📒 Files selected for processing (3)
  • testing/python/jit/test_tilelang_jit_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/jit/test_tilelang_jit_tvm_ffi.py
  • testing/python/jit/test_tilelang_jit_cython.py
  • testing/python/jit/test_tilelang_jit_nvrtc.py
📚 Learning: 2025-12-24T17:20:32.819Z
Learnt from: clouds56
Repo: tile-ai/tilelang PR: 1527
File: tilelang/env.py:0-0
Timestamp: 2025-12-24T17:20:32.819Z
Learning: The nvidia-cuda-nvcc PyPI package installs to `nvidia/cu13/bin/` (for CUDA 13), `nvidia/cu12/bin/` (for CUDA 12), and `nvidia/cu11/bin/` (for CUDA 11) in the site-packages directory, not to `nvidia/cuda_nvcc/bin/`. These paths should be used when detecting CUDA installations from PyPI packages in tilelang/env.py.

Applied to files:

  • testing/python/jit/test_tilelang_jit_nvrtc.py
🧬 Code graph analysis (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (3)
tilelang/language/v2/dtypes.py (2)
  • float32 (300-300)
  • float16 (299-299)
tilelang/language/symbolics.py (1)
  • dynamic (12-29)
tilelang/testing/__init__.py (1)
  • requires_cuda_compute_version (40-104)
testing/python/jit/test_tilelang_jit_cython.py (1)
tilelang/testing/__init__.py (1)
  • requires_cuda_compute_version (40-104)
testing/python/jit/test_tilelang_jit_nvrtc.py (1)
tilelang/testing/__init__.py (1)
  • requires_cuda_compute_version (40-104)
🔇 Additional comments (3)
testing/python/jit/test_tilelang_jit_tvm_ffi.py (2)

167-167: LGTM: Improved numerical stability with float32 accumulation.

Switching from float16 to float32 accumulation dtype across these tests is a good practice. Float32 accumulation provides better numerical stability and reduces accumulation errors in matrix multiplication operations while maintaining float16 for input/output tensors.

Also applies to: 210-210, 252-252, 301-301, 303-303, 306-306


386-386: LGTM: Appropriate CUDA gating decorators.

The decorators are correctly applied:

  • test_tvm_ffi_l2_persistent_map properly gates on CUDA availability since it uses .cuda() tensors.
  • test_tvm_ffi_pdl properly gates on both CUDA availability and Hopper architecture (9.0) since PDL (Producer-Driven Launch) is a Hopper-specific feature.

Also applies to: 446-447

testing/python/jit/test_tilelang_jit_nvrtc.py (1)

158-158: LGTM!

The @tilelang.testing.requires_cuda decorators correctly gate these tests to run only on CUDA-enabled environments. The pattern is consistent and appropriate for tests that require CUDA support but don't need specific compute capability versions.

Also applies to: 210-210, 253-253, 303-303, 389-389

@LeiWang1999 LeiWang1999 merged commit 4e53b50 into tile-ai:main Jan 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants