Skip to content

Conversation

@silentCoder-dev
Copy link
Collaborator

@silentCoder-dev silentCoder-dev commented Dec 30, 2025

Resolve #1501

Summary by CodeRabbit

  • New Features
    • Added support for cooperative-group (cooperative launch) execution on CUDA, enabling kernels to use grid-level synchronization for more advanced parallel patterns.
  • Tests
    • Added a CUDA-only test that verifies cooperative grid synchronization behavior and kernel output correctness on supported devices.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 30, 2025

📝 Walkthrough

Walkthrough

Adds detection of the "use_cooperative_groups" function attribute in CUDA runtime extraction and tags launch parameters with runtime::launch_param::kUseCooperativeLaunch. Also adds a Python test validating cooperative grid synchronization on CUDA.

Changes

Cohort / File(s) Summary
CUDA runtime launch tagging
src/target/rt_mod_cuda.cc
Check function attribute "use_cooperative_groups" in ExtractFuncInfo and append runtime::launch_param::kUseCooperativeLaunch to info.launch_param_tags.
Cooperative-grid test
testing/python/language/test_tilelang_language_cooperative.py
New test: grid_sync(N=1024) kernel using cooperative grid sync and test_grid_sync() asserts presence of cooperative sync in generated kernel, runs on CUDA, and validates output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I wiggle my nose, then hop in a line,
Threads clap in rhythm, their barriers align,
A tiny tag whispers, "Launch all as one,"
Kernels dance together, work neatly done. ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: adding kUseCooperativeLaunch tag support for tvm_ffi execution path, which directly addresses the linked issue.
Linked Issues check ✅ Passed The PR successfully addresses issue #1501 by adding cooperative launch support to the tvm_ffi/host codegen path through the kUseCooperativeLaunch tag and corresponding test coverage.
Out of Scope Changes check ✅ Passed All changes are in scope: the runtime tag addition directly fixes the cooperative launch failure in tvm_ffi, and the test module validates the fix without introducing unrelated modifications.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6a2513 and c0fac3a.

📒 Files selected for processing (1)
  • src/target/rt_mod_cuda.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)

Comment on lines +34 to +37
if (f->HasNonzeroAttr("use_cooperative_groups")) {
info.launch_param_tags.push_back(
runtime::launch_param::kUseCooperativeLaunch);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Searching for kUseCooperativeLaunch definition ==="
rg -n 'kUseCooperativeLaunch' --type cpp --type c --type h

echo -e "\n=== Searching for cudaLaunchCooperativeKernel usage ==="
rg -n 'cudaLaunchCooperativeKernel' --type cpp --type c --type h

echo -e "\n=== Checking launch_param enum/constants definitions ==="
rg -n 'launch_param' --type cpp --type c --type h -A 3 | head -80

Repository: tile-ai/tilelang

Length of output: 5337


🏁 Script executed:

#!/bin/bash

echo "=== Searching for CUDAModuleCreate implementation ==="
rg -n 'CUDAModuleCreate' --type cpp --type c --type h -B 2 -A 15

echo -e "\n=== Searching for launch_param_tags processing ==="
rg -n 'launch_param_tags' --type cpp --type c --type h -B 2 -A 2 | head -100

Repository: tile-ai/tilelang

Length of output: 4241


🏁 Script executed:

#!/bin/bash

echo "=== Check ExtractFuncInfo function context ==="
rg -n 'ExtractFuncInfo' --type cpp --type c --type h -B 2 -A 20 | head -100

echo -e "\n=== Search for use_cooperative_groups attribute ==="
rg -n 'use_cooperative_groups' --type cpp --type c --type h

Repository: tile-ai/tilelang

Length of output: 6485


🏁 Script executed:

#!/bin/bash

echo "=== Searching for launch_param namespace/enum definitions ==="
rg -n 'namespace.*launch_param|enum.*launch_param' --type cpp --type c --type h

echo -e "\n=== Searching for kUseCooperativeLaunch in entire codebase ==="
rg -n 'kUseCooperativeLaunch' -B 3 -A 3

echo -e "\n=== Searching for runtime::CUDAModuleCreate definition ==="
rg -n 'CUDAModuleCreate.*{' -A 10 --type cpp --type c

Repository: tile-ai/tilelang

Length of output: 834


🏁 Script executed:

#!/bin/bash

echo "=== Check if launch_param_tags is defined in FunctionInfo ==="
rg -n 'struct FunctionInfo|class FunctionInfo' --type cpp --type c --type h -A 15

echo -e "\n=== Search for where launch_param_tags are used in runtime ==="
rg -n 'launch_param_tags' --type cpp --type c --type h -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 2527


🏁 Script executed:

#!/bin/bash

echo "=== Searching for FunctionInfo struct with launch_param_tags ==="
rg -n 'launch_param_tags' -B 15 | head -60

echo -e "\n=== Search in runtime headers for FunctionInfo ==="
fd -e h -e hpp | xargs grep -l 'FunctionInfo' | head -10

echo -e "\n=== Search for runtime module implementation files ==="
fd -e cc -e cpp | xargs grep -l 'CUDAModuleCreate' 2>/dev/null | head -10

Repository: tile-ai/tilelang

Length of output: 4136


🏁 Script executed:

#!/bin/bash

echo "=== List all header files to locate runtime definitions ==="
fd -path '*/runtime/*' -name '*.h' -o -path '*/runtime/*' -name '*.hpp' | head -20

echo -e "\n=== Search for launch_param in all files ==="
rg 'launch_param' --type cpp --type c --type h -l | head -10

Repository: tile-ai/tilelang

Length of output: 475


🏁 Script executed:

#!/bin/bash

echo "=== Search for FunctionInfo definition ==="
rg 'struct FunctionInfo' -B 2 -A 30 --type cpp --type c

echo -e "\n=== Search for runtime namespace in code ==="
rg 'namespace runtime' --type cpp --type c -B 2 -A 5 | head -50

echo -e "\n=== Check what runtime is being used ==="
rg '#include.*runtime' --type cpp --type c | head -20

Repository: tile-ai/tilelang

Length of output: 4983


🏁 Script executed:

#!/bin/bash

echo "=== Search for where launch_param constants might be defined ==="
rg 'kUse|launch_param::' --type cpp --type c --type h | head -30

echo -e "\n=== Check the exact context of kUseCooperativeLaunch ==="
cat src/target/rt_mod_cuda.cc | sed -n '1,80p'

Repository: tile-ai/tilelang

Length of output: 3542


🏁 Script executed:

#!/bin/bash

echo "=== Search for runtime/cuda/cuda_module.h ==="
find . -name "cuda_module.h" -type f

echo -e "\n=== Search for runtime FunctionInfo definition ==="
rg 'struct FunctionInfo|class FunctionInfo' --type cpp --type c -B 5 -A 20

echo -e "\n=== Search for launch_param namespace definition ==="
rg 'namespace launch_param|const char.*kUse' --type cpp --type c --type h -B 2 -A 2

Repository: tile-ai/tilelang

Length of output: 2303


🏁 Script executed:

#!/bin/bash

echo "=== Check runtime directory structure ==="
find runtime -type f -name "*.h" 2>/dev/null | head -20

echo -e "\n=== Look for FunctionInfo in runtime files ==="
find . -path '*/runtime/*' -name '*.h' -o -path '*/runtime/*' -name '*.cc' 2>/dev/null | xargs grep -l 'FunctionInfo' 2>/dev/null

echo -e "\n=== Check if there is an external TVM dependency ==="
cat CMakeLists.txt 2>/dev/null | head -30 || echo "No CMakeLists found"

Repository: tile-ai/tilelang

Length of output: 1283


🏁 Script executed:

#!/bin/bash

echo "=== Check for TVM submodule ==="
ls -la | grep tvm
find . -name "*.h" -path "*/tvm/*" | head -5

echo -e "\n=== Look for where runtime::FunctionInfo comes from ==="
rg 'FunctionInfo' --type cpp --type c -B 3 | grep -E '(include|namespace|runtime)' | head -20

echo -e "\n=== Check what is imported from runtime ==="
cat src/target/rt_mod_cuda.cc | head -20

Repository: tile-ai/tilelang

Length of output: 1819


🏁 Script executed:

#!/bin/bash

echo "=== Search all includes in rt_mod_cuda.cc ==="
rg '#include' src/target/rt_mod_cuda.cc

echo -e "\n=== Check for runtime module implementations elsewhere ==="
find . -type d -name runtime

echo -e "\n=== Look for TVM in dependencies ==="
cat CMakeLists.txt | grep -i tvm | head -10

Repository: tile-ai/tilelang

Length of output: 945


🏁 Script executed:

#!/bin/bash

echo "=== Find cuda_module.h in local runtime ==="
find ./src/runtime -name "cuda_module.h" -o -name "*.h" | grep -i cuda

echo -e "\n=== Check runtime directory structure ==="
ls -la ./src/runtime/

echo -e "\n=== Search for FunctionInfo definition in src/runtime ==="
grep -r 'struct FunctionInfo\|class FunctionInfo' ./src/runtime/ 2>/dev/null

echo -e "\n=== Search for launch_param in src/runtime ==="
grep -r 'launch_param' ./src/runtime/ 2>/dev/null | head -20

Repository: tile-ai/tilelang

Length of output: 621


🏁 Script executed:

#!/bin/bash

echo "=== Check cuda_module.h file ==="
cat ./src/runtime/cuda/cuda_module.h 2>/dev/null | head -100

echo -e "\n=== Look for all header files in runtime/cuda ==="
ls -la ./src/runtime/cuda/ 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 143


🏁 Script executed:

#!/bin/bash

echo "=== Check for TVM submodule or external dependency ==="
find . -path "./3rdparty/*" -o -path "./tvm/*" -o -path "./third_party/*" 2>/dev/null | head -10

echo -e "\n=== Look at CMakeLists to understand TVM dependency ==="
grep -A 10 'load_tvm.cmake\|find_package.*TVM' CMakeLists.txt

echo -e "\n=== Check if TVM is available as a system library ==="
pkg-config --cflags --libs tvm 2>/dev/null || echo "TVM not found via pkg-config"

echo -e "\n=== Search for where kUseCooperativeLaunch might be defined ==="
grep -r 'kUseCooperativeLaunch' . 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 888


🏁 Script executed:

#!/bin/bash

echo "=== List the cmake load_tvm.cmake file ==="
cat cmake/load_tvm.cmake 2>/dev/null | head -50

echo -e "\n=== Look for TVM headers in the build directory ==="
find . -name "runtime.h" -path "*/tvm/*" 2>/dev/null | head -5

Repository: tile-ai/tilelang

Length of output: 1048


🏁 Script executed:

#!/bin/bash

echo "=== Search for kUseCooperativeLaunch in TVM submodule ==="
grep -r 'kUseCooperativeLaunch' ./3rdparty/tvm/ 2>/dev/null

echo -e "\n=== Search for launch_param definition in TVM ==="
grep -r 'namespace launch_param\|kUseCooperative' ./3rdparty/tvm/ 2>/dev/null | head -20

echo -e "\n=== Search for FunctionInfo struct in TVM ==="
grep -r 'struct FunctionInfo' ./3rdparty/tvm/src/runtime/ 2>/dev/null | head -20

Repository: tile-ai/tilelang

Length of output: 215


🏁 Script executed:

#!/bin/bash

echo "=== Search for cudaLaunchCooperativeKernel usage in TVM ==="
grep -r 'cudaLaunchCooperativeKernel' ./3rdparty/tvm/src/runtime/ 2>/dev/null

echo -e "\n=== Look for CUDAModuleCreate in TVM ==="
grep -r 'CUDAModuleCreate' ./3rdparty/tvm/src/runtime/ 2>/dev/null | head -10

echo -e "\n=== Search for launch_param in TVM runtime ==="
grep -r 'launch_param' ./3rdparty/tvm/include/ 2>/dev/null | head -20

Repository: tile-ai/tilelang

Length of output: 209


🏁 Script executed:

#!/bin/bash

echo "=== Check if code compiles by looking at build artifacts ==="
find . -name "*.o" -o -name "*.a" -o -name "*.so" 2>/dev/null | head -5

echo -e "\n=== Search for any definition of kUseCooperativeLaunch in entire repo ==="
grep -r 'kUseCooperativeLaunch' . --include="*.h" --include="*.cc" --include="*.cpp" 2>/dev/null

echo -e "\n=== Check git log for recent changes to rt_mod_cuda.cc ==="
git log --oneline src/target/rt_mod_cuda.cc 2>/dev/null | head -10

echo -e "\n=== Check git diff to see what changed ==="
git diff HEAD src/target/rt_mod_cuda.cc 2>/dev/null | head -100

Repository: tile-ai/tilelang

Length of output: 422


🏁 Script executed:

#!/bin/bash

echo "=== Look for all constants defined in launch_param-like patterns ==="
grep -r 'const.*char.*kUse\|const.*string.*kUse' ./3rdparty/tvm/ 2>/dev/null | grep -i launch

echo -e "\n=== Check if launch_param is defined elsewhere ==="
find ./3rdparty/tvm -name "*.h" -exec grep -l 'launch_param' {} \; 2>/dev/null

echo -e "\n=== Look at what the runtime header actually contains ==="
head -200 ./3rdparty/tvm/include/tvm/runtime/c_runtime_api.h 2>/dev/null | grep -A 50 'launch'

Repository: tile-ai/tilelang

Length of output: 239


🏁 Script executed:

#!/bin/bash

echo "=== Check the commit that added this ==="
git show c0fac3a 2>/dev/null

echo -e "\n=== Search for where constants should be defined ==="
grep -r 'kUseDynamicSharedMemoryTag' ./3rdparty/tvm/ 2>/dev/null | head -5

echo -e "\n=== Check if it's in a TVM include file ==="
find ./3rdparty/tvm/include -name "*.h" -exec grep -l 'launch_param\|FunctionInfo' {} \; 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 50376


🏁 Script executed:

#!/bin/bash

echo "=== Look for TVM runtime module includes ==="
grep -r 'include.*tvm/runtime' ./3rdparty/tvm/src/runtime/cuda/ 2>/dev/null | head -10

echo -e "\n=== Check the actual TVM cuda module implementation ==="
find ./3rdparty/tvm -name "*cuda_module*" -type f

echo -e "\n=== Look at includes in tilelang's rt_mod_cuda.cc more carefully ==="
head -10 ./src/target/rt_mod_cuda.cc

Repository: tile-ai/tilelang

Length of output: 497


The referenced constant runtime::launch_param::kUseCooperativeLaunch does not exist in this codebase or the TVM dependency, preventing compilation.

The code at lines 34-37 correctly checks for the "use_cooperative_groups" attribute and attempts to append a launch parameter tag, but it references runtime::launch_param::kUseCooperativeLaunch, which is not defined anywhere. While a similar pattern exists with tvm::runtime::launch_param::kUseDynamicSharedMemoryTag in lower_device_kernel_launch.cc, the specific cooperative launch constant is missing from the TVM runtime headers. This will cause a linker/compilation error.

Define or import the missing constant from the TVM runtime, or verify that it exists in the TVM version being used. Additionally, confirm that the CUDA runtime module downstream actually uses this tag to invoke cudaLaunchCooperativeKernel.

🤖 Prompt for AI Agents
In src/target/rt_mod_cuda.cc around lines 34-37: the code references
runtime::launch_param::kUseCooperativeLaunch which does not exist and prevents
compilation; fix by either (A) replacing that identifier with the correct
existing TVM runtime constant (confirm exact name in your TVM headers and update
the code/import to use it), or (B) if your TVM version lacks such a tag, add a
single definition for the missing constant in the appropriate
runtime::launch_param namespace/header used by this project (match the
type/value convention used for other tags), and then ensure the CUDA runtime
module checks that tag and calls cudaLaunchCooperativeKernel where required;
also verify and document the TVM version compatibility so future builds use the
correct symbol.

LeiWang1999
LeiWang1999 previously approved these changes Dec 30, 2025
@LeiWang1999
Copy link
Member

Can we add a simple test for it?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
testing/python/language/test_tilelang_language_cooperative.py (1)

32-32: Consider using torch.zeros or torch.empty for clarity.

Since the kernel overwrites all values in the first loop, the initial random values are never used. Using torch.zeros or torch.empty would make this clearer and avoid confusion.

🔎 Suggested refactor
-    tensor = torch.rand((N), dtype=torch.float32, device="cuda")
+    tensor = torch.zeros((N), dtype=torch.float32, device="cuda")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0fac3a and 1aacd58.

📒 Files selected for processing (1)
  • testing/python/language/test_tilelang_language_cooperative.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.

Applied to files:

  • testing/python/language/test_tilelang_language_cooperative.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • testing/python/language/test_tilelang_language_cooperative.py
🧬 Code graph analysis (1)
testing/python/language/test_tilelang_language_cooperative.py (2)
tilelang/language/builtin.py (1)
  • sync_grid (721-723)
tilelang/testing/__init__.py (1)
  • requires_cuda_compute_version_ge (107-108)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (5)
testing/python/language/test_tilelang_language_cooperative.py (5)

1-4: LGTM: Imports are appropriate.

The imports correctly include all necessary modules for the cooperative grid synchronization test.


7-23: LGTM: Kernel logic correctly validates cooperative synchronization.

The kernel design is clever:

  • First phase writes A[i] = i
  • Grid synchronization ensures all blocks complete phase 1
  • Second phase computes A[i] += A[N-i-1], which reads values written by other blocks
  • Expected result: all elements become N-1 (1023)

This effectively validates that cooperative grid sync works correctly across all blocks.


26-28: LGTM: Test requirements are correctly specified.

The decorators appropriately require CUDA and compute capability ≥6.0, which is the minimum for cooperative group support (Pascal architecture and later).


38-39: LGTM: Standard test entry point.

The main block correctly uses tilelang.testing.main() for test execution.


26-35: Test adequately validates cooperative launch functionality.

This test effectively addresses the PR objectives by:

  1. Verifying the kernel source includes the cooperative_groups sync call (line 31)
  2. Ensuring the kernel executes without the runtime failure described in issue #1501 (line 33)
  3. Validating the synchronization semantics are correct through output verification (line 35)

If the kUseCooperativeLaunch tag were not properly set, the kernel would fail with the CUDA launch error reported in the issue, and this test would catch that regression.

@LeiWang1999 LeiWang1999 merged commit e1138ad into tile-ai:main Dec 31, 2025
6 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Unspecified launch failure when running persistent kernel

2 participants