[Doc] Optimize the quickstart guide for clarity and not just for CUDA #858

LeiWang1999 · 2025-09-22T09:40:49Z

as title.

Summary by CodeRabbit

New Features
- Enabled CUDA-targeted JIT in the example workflow.
- Added fused ReLU to the matrix-multiplication example with updated correctness checks.
- Improved profiling flow with latency reporting.
Refactor
- Renamed the exported kernel in examples; wrapper now returns the inner compiled kernel.
Documentation
- Updated README and quickstart to reflect new kernel naming, CUDA targeting, ReLU behavior, usage, and profiling.

…ize in benchmark script

coderabbitai · 2025-09-22T09:40:57Z

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

README and example updated: inner kernel renamed to matmul_relu_kernel, a ReLU applied after GEMM, JIT wrapper changed to @tilelang.jit(target="cuda"), matmul now returns the renamed kernel, and profiling/usage paths adjusted accordingly.

Changes

Cohort / File(s)	Summary of Changes
README example updates `README.md`	Renamed inner kernel from `main` to `matmul_relu_kernel`; wrapper `matmul` now returns `matmul_relu_kernel`; decorator updated to `@tilelang.jit(target="cuda")`; added explicit ReLU after GEMM; simplified B tile copy to `T.copy`; updated example usage, PyTorch reference (with ReLU), and profiler access.
Quickstart script adjustments `examples/quickstart.py`	Applied `@tilelang.jit(target="cuda")` to `matmul`; renamed inner `main` to `matmul_relu_kernel`; inserted ReLU over accumulator tile; `matmul` returns `matmul_relu_kernel`; updated invocation, output tensor handling, correctness check using ReLU reference, and profiling to use returned kernel.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant U as User
    participant J as matmul (JIT wrapper)
    participant K as matmul_relu_kernel
    participant G as CUDA GPU
    participant P as Profiler
    participant T as PyTorch

    U->>J: call matmul(M,N,K,dtype)
    J-->>U: returns K (matmul_relu_kernel)

    U->>K: launch(A,B,C)
    K->>G: execute kernel
    rect rgba(220,235,255,0.35)
      note over G: Tile load & GEMM
      G-->>G: accumulate into C_local
      G-->>G: apply ReLU to C_local
      G-->>G: store C_local -> C
    end
    G-->>U: C populated

    U->>T: compute ref = relu(A @ B)
    T-->>U: ref tensor
    U-->>U: compare C vs ref

    U->>K: K.get_profiler(tensor_supply=...)
    K-->>P: profiler
    U->>P: run profiling
    P-->>U: latency metrics

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I hop through tiles with twitchy feet,
GEMM hums steady, ReLU makes it sweet.
I nibble latency, stamp the ground,
Kernel renamed — a joyful sound.
Carrots of code, crunchy and neat. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title claims the quickstart was broadened "not just for CUDA" but the changes in README.md and examples/quickstart.py actually introduce and document CUDA-specific usage (for example adding @tilelang.jit(target="cuda"), CUDA-targeted profiling, exposing matmul_relu_kernel, and a ReLU pass), so the title misrepresents the main edits. The PR does concern documentation and clarity, but its primary edits emphasize CUDA and a kernel rename rather than removing or de-emphasizing CUDA focus. Because the title asserts a scope change that the diff does not implement, it is misleading and should not be accepted as-is.	Update the PR title to accurately reflect the actual changes, for example: "[Doc] Clarify quickstart and document CUDA-targeted matmul_relu_kernel and profiling". If the author intended to broaden beyond CUDA, either remove the CUDA-specific targets/profiling from the changes or make additional edits to demonstrate non-CUDA usage before keeping the current title. Alternatively split the work into two PRs (clarity/docs and CUDA-specific example) so each PR can have a precise title.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6932d2e and 3e15964.

📒 Files selected for processing (1)

examples/quickstart.py (4 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-22T09:41:02Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist · 2025-09-22T09:41:05Z

Summary of Changes

Hello @LeiWang1999, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the quickstart guide and its accompanying example code to improve user comprehension and highlight the framework's broader applicability. By removing CUDA-specific optimization details and introducing a matmul_relu kernel, the changes aim to provide a clearer, more general, and more illustrative introduction to the framework's capabilities across various hardware targets.

Highlights

Quickstart Guide Clarity: The quickstart guide has been optimized for better clarity, making it easier for new users to understand the core concepts without being bogged down by advanced, hardware-specific details.
Multi-Target Support: The @tilelang.jit decorator now includes comments explicitly mentioning support for cuda, hip, and cpu targets, emphasizing the framework's versatility beyond just NVIDIA GPUs.
Example Kernel Enhancement: The example kernel has been updated from a simple matrix multiplication to a matmul_relu operation, showcasing how to integrate activation functions and providing a more comprehensive demonstration of kernel development.
Code Simplification: Advanced layout optimization imports and commented-out sections have been removed from the quickstart example, streamlining the code and focusing on essential functionalities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request does a great job of optimizing the quickstart guide for clarity and making it more platform-agnostic. The changes in README.md and examples/quickstart.py simplify the code by removing CUDA-specific details and showcasing a more streamlined API with the @tilelang.jit decorator. The addition of the ReLU operation makes the example more practical. I've added a couple of minor suggestions for the README.md file to fix some inconsistencies in the comments and code examples to ensure they are correct and easy for users to follow.

gemini-code-assist · 2025-09-22T09:43:23Z

README.md

+# 1. Define the kernel (matmul) and compile/lower it into an executable module
+matmul_relu_kernel = matmul(M, N, K, block_M, block_N, block_K)

 # 3. Test the kernel in Python with PyTorch data


The numbering of steps in the guide has a gap, jumping from step 1 to 3. To avoid confusion for new users, this should be corrected to step 2.

Suggested change

# 3. Test the kernel in Python with PyTorch data

# 2. Test the kernel in Python with PyTorch data

gemini-code-assist · 2025-09-22T09:43:23Z

README.md

 # 4. Retrieve and inspect the generated CUDA source (optional)
-cuda_source = jit_kernel.get_kernel_source()
-print("Generated CUDA kernel:\n", cuda_source)
+# cuda_source = jit_kernel.get_kernel_source()


The variable jit_kernel is no longer used in this example; it has been replaced by matmul_relu_kernel. This commented-out line should be updated to use the correct variable so that it works correctly if a user decides to uncomment it.

Suggested change

# cuda_source = jit_kernel.get_kernel_source()

# cuda_source = matmul_relu_kernel.get_kernel_source()

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (9)

examples/quickstart.py (3)
40-42: Use a float literal for ReLU clamp to avoid implicit int→float cast.

Replace 0 with 0.0 for type clarity across dtypes.
-            for i, j in T.Parallel(block_M, block_N):
-                C_local[i, j] = T.max(C_local[i, j], 0)
+            for i, j in T.Parallel(block_M, block_N):
+                C_local[i, j] = T.max(C_local[i, j], 0.0)
63-67: Generalize device selection so the example isn’t CUDA-only.

Pick device dynamically; keeps the quickstart consistent with the PR’s goal.
-a = torch.randn(M, K, device="cuda", dtype=torch.float16)
-b = torch.randn(K, N, device="cuda", dtype=torch.float16)
-c = torch.empty(M, N, device="cuda", dtype=torch.float16)
+dev = "cuda" if torch.cuda.is_available() else "cpu"
+a = torch.randn(M, K, device=dev, dtype=torch.float16)
+b = torch.randn(K, N, device=dev, dtype=torch.float16)
+c = torch.empty(M, N, device=dev, dtype=torch.float16)
Note: If CPU support is intended, torch.float16 matmul may not be available on CPU; consider switching both the kernel dtype and torch tensors to float32 in that case.

68-70: Fix comment: this call executes the kernel, not the profiler.
-# Run the kernel through the Profiler
+# Execute the kernel
README.md (6)
127-131: Keep auto-target, but show explicit target selection as a tip.

Small copy tweak improves clarity for users targeting HIP/CPU explicitly.
-# @tilelang.jit(target="cuda")
-# target currently can be "cuda" or "hip" or "cpu".
-# if not specified, it will be inferred from the input tensors during compile time
+## Tip: explicit target selection if needed
+# @tilelang.jit(target="cuda")  # or target="hip" or "cpu"
+# If not specified, target is inferred from input tensors at compile time.
162-166: Use 0.0 in ReLU to avoid int→float cast and improve readability.
-            for i, j in T.Parallel(block_M, block_N):
-                C_local[i, j] = T.max(C_local[i, j], 0)
+            for i, j in T.Parallel(block_M, block_N):
+                C_local[i, j] = T.max(C_local[i, j], 0.0)
187-190: Make device selection portable; avoid CUDA-only tensor creation.

Keeps the quickstart aligned with “not just for CUDA.”
-a = torch.randn(M, K, device="cuda", dtype=torch.float16)
-b = torch.randn(K, N, device="cuda", dtype=torch.float16)
-c = torch.empty(M, N, device="cuda", dtype=torch.float16)
+dev = "cuda" if torch.cuda.is_available() else "cpu"
+a = torch.randn(M, K, device=dev, dtype=torch.float16)
+b = torch.randn(K, N, device=dev, dtype=torch.float16)
+c = torch.empty(M, N, device=dev, dtype=torch.float16)
If CPU is in scope, you may need to use float32 on CPU and pass dtype="float32" into matmul(...). Want a short snippet for that?

192-193: Comment nit: this line runs the kernel, not the profiler.
-# Run the kernel through the Profiler
+# Execute the kernel
206-208: Minor copyedit for section title.
-# 5.Profile latency with kernel
+# 5. Profile kernel latency
203-205: Update README snippet: replace stale jit_kernel variable and CUDA-specific wording
The commented snippet at README.md:203-204 still uses jit_kernel/“CUDA”; update it to use the returned kernel variable and neutral wording.
-# 4. Retrieve and inspect the generated CUDA source (optional)
-# cuda_source = jit_kernel.get_kernel_source()
-# print("Generated CUDA kernel:\n", cuda_source)
+# 4. Retrieve and inspect the generated device-specific source (optional)
+# src = matmul_relu_kernel.get_kernel_source()
+# print("Generated kernel source:\n", src)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd16865 and 6932d2e.

📒 Files selected for processing (2)

README.md (2 hunks)
examples/quickstart.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/quickstart.py (3)

tilelang/jit/__init__.py (1)

jit (237-310)

tilelang/jit/kernel.py (1)

get_profiler (360-376)

tilelang/utils/tensor.py (1)

TensorSupplyType (11-18)

🔇 Additional comments (4)

examples/quickstart.py (3)

4-7: Auto-target JIT aligns with “not just CUDA.”

Good change to let the backend be inferred from inputs; keeps the example portable.

11-15: Kernel entry rename improves clarity.

Returning a clearly named prim_func is nicer than a generic main.

47-47: Returning the prim_func object is the right UX.

Enables direct .get_profiler(...) and invocation without extra wrapping.

README.md (1)

134-138: Kernel rename/readability LGTM.

…tile-ai#858) * Refactor matmul example to include ReLU activation and update batch size in benchmark script * lint fix

Refactor matmul example to include ReLU activation and update batch s…

6932d2e

…ize in benchmark script

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

lint fix

3e15964

LeiWang1999 merged commit 058a670 into tile-ai:main Sep 22, 2025
3 of 4 checks passed

coderabbitai bot mentioned this pull request Oct 4, 2025

[Profiler]Adds CUPTI profiler support #936

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Merged

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

coderabbitai bot mentioned this pull request Nov 27, 2025

[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder #1352

Merged

kurisu6912 mentioned this pull request Dec 17, 2025

[Bug] Fix tvm build script when patchelf is not found #1459

Merged

coderabbitai bot mentioned this pull request Jan 29, 2026

[EagerJIT] Update README example to eager jit #1752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] Optimize the quickstart guide for clarity and not just for CUDA #858

[Doc] Optimize the quickstart guide for clarity and not just for CUDA #858

Uh oh!

LeiWang1999 commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 22, 2025 •

edited

Loading

Review failed

Other AI code review bot(s) detected

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 22, 2025

Uh oh!

gemini-code-assist bot Sep 22, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# 3. Test the kernel in Python with PyTorch data
	# 2. Test the kernel in Python with PyTorch data

	# cuda_source = jit_kernel.get_kernel_source()
	# cuda_source = matmul_relu_kernel.get_kernel_source()

[Doc] Optimize the quickstart guide for clarity and not just for CUDA #858

[Doc] Optimize the quickstart guide for clarity and not just for CUDA #858

Uh oh!

Conversation

LeiWang1999 commented Sep 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

gemini-code-assist bot commented Sep 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 22, 2025 •

edited

Loading