Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Jan 14, 2026

Summary

  • Replace local do_bench function definitions with centralized from tilelang.profiler import do_bench import
  • Replace from triton.testing import do_bench with from tilelang.profiler import do_bench in examples

Changed Files

Import replacement (triton → tilelang.profiler)

  • examples/gemm_sp/example_gemm_sp.py
  • examples/gemm_sp/example_custom_compress.py

Remove local do_bench definition

  • examples/flash_decoding/example_gqa_decode_varlen_logits.py
  • examples/gdn/example_chunk_delta_h.py
  • examples/gdn/example_chunk_o_bwd.py
  • examples/gdn/example_chunk_delta_bwd.py

Test plan

  • Verify examples still run correctly with the centralized do_bench

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor

    • Consolidated duplicated benchmarking helpers into a single profiler module import across examples.
    • Simplified the profiler implementation for clearer, more maintainable benchmarking behavior.
  • Breaking Changes

    • Removed TVM backend support from the profiler; benchmarking now uses a unified adapter-based approach.

✏️ Tip: You can customize this high-level summary in your review settings.

…mport

- Removed local implementations of the `do_bench` function from multiple example files.
- Updated imports to use the centralized `do_bench` function from `tilelang.profiler`, promoting code reuse and consistency across examples.
…nd __init__.py

- Eliminated the unused `do_bench` import from `example_chunk_o_bwd.py` and `tilelang/profiler/__init__.py`, streamlining the code and improving clarity.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 14, 2026

📝 Walkthrough

Walkthrough

The PR centralizes benchmarking by removing local do_bench implementations in examples and importing do_bench from tilelang.profiler. The profiler itself is simplified: TVM-specific detection/paths removed and benchmarking unified via an adapter-driven do_bench.

Changes

Cohort / File(s) Summary
Flash Decoding Examples
examples/flash_decoding/example_gqa_decode_varlen_logits.py
Removed local do_bench (≈25 lines); added from tilelang.profiler import do_bench and adjusted calls to use the imported helper.
GDN Examples
examples/gdn/example_chunk_delta_bwd.py, examples/gdn/example_chunk_delta_h.py, examples/gdn/example_chunk_o_bwd.py
Deleted local do_bench definitions (multiple removals) and imported do_bench from tilelang.profiler.
GEMM Sparse Examples
examples/gemm_sp/example_custom_compress.py, examples/gemm_sp/example_gemm_sp.py
Switched do_bench import source from triton.testing to tilelang.profiler; no other logic changes.
Profiler Module
tilelang/profiler/__init__.py
Removed determine_profiler; eliminated TVM-specific branches and related input prep/timing; simplified do_bench to use adapter-provided function and added an assertion that an adapter exists for func.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through files, one by one,
Collected benchmarks until the pruning was done.
Adapters now hum a single tune,
Less duplication, more room to swoon. 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main refactoring change: centralizing do_bench imports from tilelang.profiler across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@examples/flash_decoding/example_gqa_decode_varlen_logits.py`:
- Line 8: do_bench is being called with positional arguments that it doesn't
accept; wrap the benchmark target and its positional args using
functools.partial (or a lambda) so do_bench receives a single callable and only
keyword args itself. Locate the do_bench calls that pass
flash_attn_with_attn_pool_decode_tilelang (and similarly any other flash_attn*
benchmarks) and change them to
do_bench(functools.partial(flash_attn_with_attn_pool_decode_tilelang, q_decode,
k_varlen, v_varlen, cu_seqlens_k, max_seqlen_k, args.k_seqlen, 1, softmax_scale,
sink, block_size, False, tl_kernel), warmup=..., rep=..., ...) or equivalent,
ensuring you import functools.partial and preserve the existing do_bench keyword
parameters.

In `@examples/gdn/example_chunk_delta_bwd.py`:
- Line 7: The do_bench import now points to the centralized profiler which
expects a callable plus timing params, but the current calls pass tensor args
directly (see do_bench calls with chunk_gated_delta_rule_bwd_dhu and kernel and
tensors Q, K, W, G, h0, dht, dO, dv, scale, chunk_size), causing the tensors to
be parsed as timing parameters; fix by wrapping the target function and its
arguments into a zero-arg callable (e.g., use a lambda or functools.partial) so
do_bench receives a single callable and pass chunk_size as a keyword inside that
wrapper or via partial, e.g. wrap chunk_gated_delta_rule_bwd_dhu with its
tensors and chunk_size and similarly wrap kernel before calling do_bench.

In `@examples/gdn/example_chunk_delta_h.py`:
- Line 7: The calls to do_bench must be adapted to the centralized signature
that invokes fn() with no args: wrap the target functions
(chunk_gated_delta_rule_fwd_h and kernel) into zero-argument callables (e.g.,
lambda or functools.partial) that capture K, W, U, G, initial_state and any
other inputs, and then call do_bench with explicit benchmarking parameters
(warmup, rep, _n_warmup, _n_repeat, quantiles, fast_flush, backend, return_mode)
rather than passing tensors as positional/keyword args; update the two sites
where do_bench is invoked so they pass a zero-arg wrapper and appropriate
numeric/flag values for the benchmark options.
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4084dcd and 0c44cab.

📒 Files selected for processing (7)
  • examples/flash_decoding/example_gqa_decode_varlen_logits.py
  • examples/gdn/example_chunk_delta_bwd.py
  • examples/gdn/example_chunk_delta_h.py
  • examples/gdn/example_chunk_o_bwd.py
  • examples/gemm_sp/example_custom_compress.py
  • examples/gemm_sp/example_gemm_sp.py
  • tilelang/profiler/__init__.py
💤 Files with no reviewable changes (1)
  • examples/gdn/example_chunk_o_bwd.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.

Applied to files:

  • examples/gemm_sp/example_custom_compress.py
  • examples/gemm_sp/example_gemm_sp.py
🧬 Code graph analysis (5)
examples/gdn/example_chunk_delta_bwd.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (193-233)
examples/flash_decoding/example_gqa_decode_varlen_logits.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (193-233)
examples/gemm_sp/example_custom_compress.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (193-233)
examples/gemm_sp/example_gemm_sp.py (1)
tilelang/profiler/__init__.py (1)
  • do_bench (193-233)
tilelang/profiler/__init__.py (1)
tilelang/profiler/bench.py (1)
  • do_bench (64-135)
🔇 Additional comments (3)
tilelang/profiler/__init__.py (1)

219-233: LGTM! Clean simplification of the benchmarking logic.

The refactored do_bench method correctly delegates to the centralized do_bench function from tilelang.profiler.bench. The parameter mapping is correct, and the assertion ensures an adapter is provided when no explicit function is passed.

examples/gemm_sp/example_gemm_sp.py (1)

9-9: LGTM! Import updated correctly.

The import change from triton.testing to tilelang.profiler aligns with the PR objective. The usage at lines 115-116 correctly wraps kernel calls in lambdas, making them compatible with the centralized do_bench signature.

examples/gemm_sp/example_custom_compress.py (1)

10-10: LGTM! Import updated correctly.

The import change from triton.testing to tilelang.profiler aligns with the PR objective. The usage at lines 314-315 correctly wraps kernel calls in lambdas, compatible with the centralized do_bench signature.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

import argparse
import tilelang
import tilelang.language as T
from tilelang.profiler import do_bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the signature of do_bench in tilelang/profiler

echo "=== Looking for do_bench function definition ==="
rg -n "def do_bench" --type py -A 20

echo ""
echo "=== Search for tilelang/profiler module ==="
find . -name "profiler.py" -o -name "profiler" -type d | head -20

Repository: tile-ai/tilelang

Length of output: 3149


🏁 Script executed:

#!/bin/bash
# View the do_bench calls in the file

echo "=== Calls to do_bench around lines 685-699 ==="
sed -n '680,700p' examples/flash_decoding/example_gqa_decode_varlen_logits.py

echo ""
echo "=== Calls to do_bench around lines 704-716 ==="
sed -n '700,720p' examples/flash_decoding/example_gqa_decode_varlen_logits.py

Repository: tile-ai/tilelang

Length of output: 1305


Critical: do_bench calls will fail due to signature incompatibility.

The imported tilelang.profiler.do_bench function signature only accepts fn and keyword arguments (warmup, rep, _n_warmup, _n_repeat, quantiles, fast_flush, backend, return_mode). It does not support the *args calling convention used in this file.

The calls at lines 687-700 and 709-720 pass 11-12 positional arguments after the function:

do_bench(flash_attn_with_attn_pool_decode_tilelang, q_decode, k_varlen, v_varlen, cu_seqlens_k, max_seqlen_k, args.k_seqlen, 1, softmax_scale, sink, block_size, False, tl_kernel)

This will raise TypeError: got unexpected positional arguments at runtime. Wrap each function call with functools.partial to bind the arguments, or refactor to use keyword arguments if the benchmarked function supports them.

🤖 Prompt for AI Agents
In `@examples/flash_decoding/example_gqa_decode_varlen_logits.py` at line 8,
do_bench is being called with positional arguments that it doesn't accept; wrap
the benchmark target and its positional args using functools.partial (or a
lambda) so do_bench receives a single callable and only keyword args itself.
Locate the do_bench calls that pass flash_attn_with_attn_pool_decode_tilelang
(and similarly any other flash_attn* benchmarks) and change them to
do_bench(functools.partial(flash_attn_with_attn_pool_decode_tilelang, q_decode,
k_varlen, v_varlen, cu_seqlens_k, max_seqlen_k, args.k_seqlen, 1, softmax_scale,
sink, block_size, False, tl_kernel), warmup=..., rep=..., ...) or equivalent,
ensuring you import functools.partial and preserve the existing do_bench keyword
parameters.


import tilelang
import tilelang.language as T
from tilelang.profiler import do_bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: do_bench call signatures are incompatible with the centralized function.

The import change introduces a breaking issue. At lines 480-481, do_bench is called with tensor arguments:

fla_time = do_bench(chunk_gated_delta_rule_bwd_dhu, Q, K, W, G, h0, dht, dO, dv, scale, chunk_size=chunk_size)
tilelang_time = do_bench(kernel, Q, K, W, G, h0, dht, dO, dv)

The centralized do_bench from tilelang.profiler.bench expects fn, warmup, rep, _n_warmup, _n_repeat, ... — not input tensors. The tensors will be incorrectly interpreted as timing parameters.

The calls should wrap the function and its arguments in a lambda or functools.partial:

🐛 Proposed fix
-    fla_time = do_bench(chunk_gated_delta_rule_bwd_dhu, Q, K, W, G, h0, dht, dO, dv, scale, chunk_size=chunk_size)
-    tilelang_time = do_bench(kernel, Q, K, W, G, h0, dht, dO, dv)
+    fla_time = do_bench(lambda: chunk_gated_delta_rule_bwd_dhu(Q, K, W, G, h0, dht, dO, dv, scale, chunk_size=chunk_size))
+    tilelang_time = do_bench(lambda: kernel(Q, K, W, G, h0, dht, dO, dv))
🤖 Prompt for AI Agents
In `@examples/gdn/example_chunk_delta_bwd.py` at line 7, The do_bench import now
points to the centralized profiler which expects a callable plus timing params,
but the current calls pass tensor args directly (see do_bench calls with
chunk_gated_delta_rule_bwd_dhu and kernel and tensors Q, K, W, G, h0, dht, dO,
dv, scale, chunk_size), causing the tensors to be parsed as timing parameters;
fix by wrapping the target function and its arguments into a zero-arg callable
(e.g., use a lambda or functools.partial) so do_bench receives a single callable
and pass chunk_size as a keyword inside that wrapper or via partial, e.g. wrap
chunk_gated_delta_rule_bwd_dhu with its tensors and chunk_size and similarly
wrap kernel before calling do_bench.

import tilelang
import tilelang.language as T
from tilelang.autotuner import autotune
from tilelang.profiler import do_bench
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, find the profiler module and examine do_bench signature
find . -type f -name "bench.py" | grep -E "profiler|tilelang"

Repository: tile-ai/tilelang

Length of output: 88


🏁 Script executed:

#!/bin/bash
# Explore the repository structure
ls -la

Repository: tile-ai/tilelang

Length of output: 2689


🏁 Script executed:

#!/bin/bash
# Check if tilelang/profiler exists and what's in it
find . -type d -name "profiler" 2>/dev/null | head -5

Repository: tile-ai/tilelang

Length of output: 105


🏁 Script executed:

#!/bin/bash
# Find all Python files in tilelang directory
find . -path "./tilelang/profiler*" -type f -name "*.py" 2>/dev/null

Repository: tile-ai/tilelang

Length of output: 120


🏁 Script executed:

#!/bin/bash
# Examine the example file to see the actual do_bench usage
if [ -f "examples/gdn/example_chunk_delta_h.py" ]; then
  echo "=== Lines 300-320 of example_chunk_delta_h.py ==="
  sed -n '300,320p' examples/gdn/example_chunk_delta_h.py
fi

Repository: tile-ai/tilelang

Length of output: 830


🏁 Script executed:

#!/bin/bash
# Examine the do_bench function signature in bench.py
cat -n ./tilelang/profiler/bench.py | head -50

Repository: tile-ai/tilelang

Length of output: 2029


🏁 Script executed:

#!/bin/bash
# Get the full function signature
ast-grep --pattern $'def do_bench($$$):
    $$$'

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Continue reading bench.py to find do_bench function
cat -n ./tilelang/profiler/bench.py | tail -n +50 | head -100

Repository: tile-ai/tilelang

Length of output: 4530


🏁 Script executed:

#!/bin/bash
# Search for the do_bench function definition
rg "def do_bench" -A 10 ./tilelang/profiler/bench.py

Repository: tile-ai/tilelang

Length of output: 399


🏁 Script executed:

#!/bin/bash
# Check git history to see the previous version
git log --oneline -5 HEAD

Repository: tile-ai/tilelang

Length of output: 151


🏁 Script executed:

#!/bin/bash
# Get the previous commit and show the file from before
git show HEAD~1:examples/gdn/example_chunk_delta_h.py 2>/dev/null | head -320 | tail -50

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check the diff of the current file in HEAD
git diff HEAD~1 examples/gdn/example_chunk_delta_h.py 2>/dev/null | head -100

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check git log to understand the history better
git log --oneline examples/gdn/example_chunk_delta_h.py 2>/dev/null | head -10

Repository: tile-ai/tilelang

Length of output: 151


🏁 Script executed:

#!/bin/bash
# Check what files were changed in the HEAD commit
git show --stat HEAD

Repository: tile-ai/tilelang

Length of output: 50372


🏁 Script executed:

#!/bin/bash
# Show the diff of the specific file
git show HEAD:examples/gdn/example_chunk_delta_h.py 2>/dev/null | sed -n '1,20p'

Repository: tile-ai/tilelang

Length of output: 669


🏁 Script executed:

#!/bin/bash
# Check the previous commit for that file
git show HEAD~1:examples/gdn/example_chunk_delta_h.py 2>/dev/null | sed -n '1,320p' | tail -50

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check what the previous version of the file looked like (look further back in history)
git log --all --pretty=oneline examples/gdn/example_chunk_delta_h.py 2>/dev/null | head -20

Repository: tile-ai/tilelang

Length of output: 1134


🏁 Script executed:

#!/bin/bash
# Look at the previous commit to see if there was a local do_bench definition
git show 0f980f15c575bf35db73a70fc04a8a53c005b2c8:examples/gdn/example_chunk_delta_h.py 2>/dev/null | sed -n '1,50p'

Repository: tile-ai/tilelang

Length of output: 1466


🏁 Script executed:

#!/bin/bash
# Check if the previous version had a local do_bench function
git show 0f980f15c575bf35db73a70fc04a8a53c005b2c8:examples/gdn/example_chunk_delta_h.py 2>/dev/null | grep -A 30 "def do_bench"

Repository: tile-ai/tilelang

Length of output: 778


Fix do_bench call signatures incompatible with the centralized function.

The import change is correct, but the calls at lines 302-313 are incompatible with the centralized do_bench signature.

The previous local implementation accepted do_bench(fn, *args, warmup=10, rep=10, **kwargs) and passed *args and **kwargs to the function. The centralized version from tilelang.profiler.bench has signature do_bench(fn, warmup, rep, _n_warmup, _n_repeat, quantiles, fast_flush, backend, return_mode) and calls fn() with no arguments.

Lines 302-312 call do_bench(chunk_gated_delta_rule_fwd_h, k=K, w=W, u=U, ...) with keyword arguments that don't exist in the new signature, causing TypeError: unexpected keyword arguments.

Line 313 calls do_bench(kernel, K, W, U, G, initial_state) with tensor values as positional arguments, which will be misinterpreted as warmup, rep, _n_warmup, _n_repeat, and quantiles parameters with type mismatches.

Both calls need to be refactored to wrap the function invocations appropriately for the new centralized do_bench interface.

🤖 Prompt for AI Agents
In `@examples/gdn/example_chunk_delta_h.py` at line 7, The calls to do_bench must
be adapted to the centralized signature that invokes fn() with no args: wrap the
target functions (chunk_gated_delta_rule_fwd_h and kernel) into zero-argument
callables (e.g., lambda or functools.partial) that capture K, W, U, G,
initial_state and any other inputs, and then call do_bench with explicit
benchmarking parameters (warmup, rep, _n_warmup, _n_repeat, quantiles,
fast_flush, backend, return_mode) rather than passing tensors as
positional/keyword args; update the two sites where do_bench is invoked so they
pass a zero-arg wrapper and appropriate numeric/flag values for the benchmark
options.

@LeiWang1999 LeiWang1999 merged commit 651f885 into tile-ai:main Jan 17, 2026
6 checks passed
oraluben pushed a commit to oraluben/tilelang that referenced this pull request Jan 18, 2026
* refactor: replace local do_bench function with centralized profiler import

- Removed local implementations of the `do_bench` function from multiple example files.
- Updated imports to use the centralized `do_bench` function from `tilelang.profiler`, promoting code reuse and consistency across examples.

* refactor: remove unused do_bench import from example_chunk_o_bwd.py and __init__.py

- Eliminated the unused `do_bench` import from `example_chunk_o_bwd.py` and `tilelang/profiler/__init__.py`, streamlining the code and improving clarity.

* lint fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant