Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
378 commits
Select commit Hold shift + click to select a range
f4f87f4
[Bugfix] Improve autotune from elementwise_add function in examples (…
senlyu163 Dec 17, 2025
0814b17
[Language] Introduce `T.annotate_restrict_buffers` (#1428)
LeiWang1999 Dec 17, 2025
f914f2d
[Analyzer] Require loop extent > 0 when entering loop (#1451)
kurisu6912 Dec 17, 2025
0c25c4f
Updat ROCm CI to Nightly-ROCm-7.1 (#1449)
Gongen-Ali Dec 17, 2025
c750fb8
[Enhancement] Update examples and tests for improved type handling fu…
LeiWang1999 Dec 17, 2025
aa19342
[Issue Template] Enable blank issues in GitHub issue template(#1453)
LeiWang1999 Dec 17, 2025
6aaf3c7
[CI] Moved the clang-tidy step to after pip install (#1456)
LeiWang1999 Dec 17, 2025
3ee0939
[Bug] Fix tvm build script when patchelf is not found #1459)
kurisu6912 Dec 17, 2025
91cf796
[Analyzer] Fix floordiv & floormod bug in z3 prover (#1458)
kurisu6912 Dec 17, 2025
48e70e6
[Cache] Rename sparse compress cache directory (#1460)
LeiWang1999 Dec 17, 2025
cae06ed
[Language]Adds a random number generation capability through curand_k…
silentCoder-dev Dec 18, 2025
a6f59f3
remove unused duplicated type check (#1462)
sgjzfzzf Dec 18, 2025
7248a81
feat(cutedsl): add CuTeDSL backend (#1421)
lucifer1004 Dec 18, 2025
f067260
[Refactor] Rename test for curand & add triton baseline in `test_tile…
silentCoder-dev Dec 19, 2025
f6db201
[ArgBinder] Enhance shape variable handling and assertions (#1467)
LeiWang1999 Dec 19, 2025
1a3a64f
[Language] Make TL scripts friendly to Python syntax highlights (#1466)
SiriusNEO Dec 19, 2025
95e3b5a
[Refactor] Remove triton dependence in testing & move triton baseline…
silentCoder-dev Dec 19, 2025
3516f1e
[Language] Enhance T.dtype.as_torch conversion for compatibility (#1473)
LeiWang1999 Dec 19, 2025
2217eb7
[News] update with latest news (#1475)
LeiWang1999 Dec 19, 2025
168aec7
[Enhancement] Use static Z3 context (#1482)
LeiWang1999 Dec 19, 2025
7e8d1f8
[Enhancement] Enhance let binding handling in layout inference and wa…
LeiWang1999 Dec 20, 2025
a874e4e
[Refactor] Phaseout PassConfig `kDisableDynamicTailSplit` and `kDynam…
LeiWang1999 Dec 21, 2025
a431797
[Enhancement] Optimize the time cost of critical path for IntervalSet…
LeiWang1999 Dec 22, 2025
ba23181
[CI] Add preformance regression test script (#1489)
xwhzz Dec 22, 2025
718e398
Pin nvidia-cutlass-dsl to 4.3.3 (#1497)
lucifer1004 Dec 22, 2025
5acaab7
[Language] Remove ConstIf Frame for better meta programming (#1496)
kurisu6912 Dec 22, 2025
6e0982d
[CI] Fix concurrency bug in regression test workflow (#1500)
xwhzz Dec 22, 2025
1d9a2ea
[Refactor] Phaseout legacy `alloc_local` statement in examples and in…
LeiWang1999 Dec 22, 2025
2d8bf3e
[Enhancement] Optimize MHA varlen fwd and support autotune (#1499)
Rachmanino Dec 22, 2025
174fbe1
[Enhancement] Refactor CUDA vectorized cast generation and remove uns…
LJC00118 Dec 22, 2025
3593a73
[Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when …
LeiWang1999 Dec 23, 2025
74aef5b
Update cutedsl docs and version check(#1503)
lucifer1004 Dec 23, 2025
4d8e609
[Misc] configure pymarkdown (#1505)
lucifer1004 Dec 23, 2025
e79bbcc
[Language] Fix gemm syntax highlight (#1476)
SiriusNEO Dec 23, 2025
783694f
[Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi (#1…
kurisu6912 Dec 23, 2025
11f122e
[Refactor] Phaseout execution_backend `ctypes` (#1510)
LeiWang1999 Dec 23, 2025
c7e8cab
[Testing] Add Memory Leak Test (#1516)
kurisu6912 Dec 24, 2025
09385e7
[Refactor] Support auto swizzling for tma store and phaseout related …
LeiWang1999 Dec 24, 2025
41603f8
[CuTeDSL][Fix] thread safety + context safety (#1513)
lucifer1004 Dec 24, 2025
feb106b
[BugFix] Phaseout unused tests for gqa decode kernels and add the ker…
tzj-fxz Dec 24, 2025
42697c0
[Cleanup] Remove unnecessary macros in tilelang examples (#1514)
Rachmanino Dec 24, 2025
98bc297
Fix ramp_lanes calculation in CUDA codegen (#1518)
LJC00118 Dec 24, 2025
0006621
[Misc] add env for default target/backend/verbose (#1512)
lucifer1004 Dec 24, 2025
bea40bd
[Dtype] Improve host codegen handling for subtype (#1517)
LeiWang1999 Dec 24, 2025
cfccd63
[Bugfix] Fallback to a Linear Layout instead of raising errors (#1521)
LeiWang1999 Dec 24, 2025
2ca5e39
Use `TargetIsCuda` for all cuda target (#1522)
oraluben Dec 24, 2025
d0bcc69
Fix fp4 pointer arithmetic in CUDA codegen (#1524)
LJC00118 Dec 24, 2025
d7e264f
[Enhancement] Improve GitHub Actions permissions check and refine per…
xwhzz Dec 24, 2025
3c11823
[Release] Bump version into 0.1.7.post1 (#1506)
LeiWang1999 Dec 24, 2025
d140415
[Pipeline] Refactor buffer allocation in Inject Pipeline Pass (#1525)
LeiWang1999 Dec 24, 2025
0c3d913
[Dev] Fix when build local version with isolated build (#1487)
oraluben Dec 25, 2025
2b79a76
[Bugfix] Skip stride check for subtype (#1531)
LeiWang1999 Dec 25, 2025
3ce8ac9
[Lint] Enable whitespace and permission bit hooks (#1439)
XuehaiPan Dec 25, 2025
14067c3
[Enhancement][Tool] Tree-style pretty ASTPrinter (#1468)
SiriusNEO Dec 25, 2025
d5d959e
[Fix] Add support for non-var complement arithmetic computation (#137…
kurisu6912 Dec 25, 2025
dff10e5
[BugFix] Complete vectorized loading for common dtypes (#1536)
SiriusNEO Dec 25, 2025
d219f6c
[Compat] Add CUDA version check for __nv_fp8_e8m0 type (#1537)
LeiWang1999 Dec 25, 2025
2e82f37
[Bug] Fix bugs of varlen attention forward examples caused by `S_q !=…
hukongyi Dec 26, 2025
a9d65d9
[Bug] Fix hanging from reduction on sm120 (#1540)
PannenetsF Dec 26, 2025
5bba4df
[example] use T.dynamic instead of tvm.te.var (#1538)
botbw Dec 26, 2025
9ff7c52
[Enhancement] Refactor KernelCache to use inheritance-based design (#…
sgjzfzzf Dec 26, 2025
9b58ed0
[Bugfix] Avoid considering `local.var` buffer as `local` (#1541)
LeiWang1999 Dec 26, 2025
875b42f
[Bugfix] Fix of `T.Fill` for local.var (#1543)
LeiWang1999 Dec 26, 2025
c9371a5
[Z3] Change z3 timeout to rlimit for determistic prove behavior (#1542)
kurisu6912 Dec 27, 2025
72ce848
[Feat] Adapt gemm v2 for cutedsl backend (#1544)
lucifer1004 Dec 27, 2025
d70cf36
[Enhancement] Support larger `H` in deepseek sparse mla backward via …
Rachmanino Dec 27, 2025
23ede42
[Bugfix] Fix regression test to use installed package instead of sour…
xwhzz Dec 28, 2025
b6ace13
[Refactor] Introduce layout annotations for `ParallelOPNode` and `Cop…
LeiWang1999 Dec 28, 2025
f57956d
[Script] Provide regression test script to help benchmark regression …
LeiWang1999 Dec 28, 2025
470d8b2
[Typing] Update Kernel signature and add type hints for buffer operat…
clouds56 Dec 29, 2025
193eff1
[CI]: Bump actions/upload-artifact from 4 to 6 (#1555)
dependabot[bot] Dec 29, 2025
d317710
Use cuda capability from torch to be more generic (#1557)
oraluben Dec 29, 2025
9f998e3
[CI]: Bump actions/github-script from 7 to 8 (#1556)
dependabot[bot] Dec 29, 2025
27db71f
[Host] Provide post process to customize host code and enhance nullab…
LeiWang1999 Dec 29, 2025
e64961f
[Release] Build tilelang against CUDA 13.1 in CI (#1532)
oraluben Dec 29, 2025
b702299
[LazyJIT] Move Type Annotations to Function Body (#1480)
kurisu6912 Dec 29, 2025
124583b
[bugfix] fix missing logic for clear_accum (#1563)
botbw Dec 29, 2025
f4ad7d3
[Misc] Remove unused `tl_pipeline_sync`. (#1566)
c8ef Dec 29, 2025
cca8b6f
[Refactor] Improve scalarization handling in vectorization logic (#1565)
LeiWang1999 Dec 29, 2025
e23bce7
[Refactor] Simplify do_bench calls by using default warmup and rep pa…
LeiWang1999 Dec 29, 2025
8c9101e
[CI] Refactor PR regression test job conditions (#1569)
xwhzz Dec 29, 2025
0f9bbd7
[Parallel][Infer] Free-mode chooses minimal replication between buffe…
LeiWang1999 Dec 30, 2025
b6a2513
[Refactor] Enhance deterministic ordering in shared memory allocation…
LeiWang1999 Dec 30, 2025
0fa16b4
[Enhancement] Improve equality checks in layout nodes and fragment va…
LeiWang1999 Dec 30, 2025
e1138ad
[Feature] add kUseCooperativeLaunch tag for tvm_ffi (#1572)
silentCoder-dev Dec 31, 2025
7cf1f26
[Refactor] Remove unnecessary logging configuration in Analyzer.py (#…
LeiWang1999 Dec 31, 2025
53ea96c
[Release] Bump version to 0.1.7.post2 (#1575)
LeiWang1999 Dec 31, 2025
15c457f
[BugFix] Change default rounding mode for fp4 conversions (#1580)
LJC00118 Dec 31, 2025
3b7ebc0
[CI] Add CUDA-aware pytest scheduler + auto workers (#1584)
LeiWang1999 Dec 31, 2025
dcacc5a
[Enhancement] Improve performance regression output with timing and s…
xwhzz Dec 31, 2025
0643349
[Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter (#1…
haok1402 Jan 1, 2026
e1f76d1
Add PrimExpr substitution support for AttrStmt nodes (#1583)
LJC00118 Jan 1, 2026
d6eb5d3
[BugFix] fix tcgen5mma example (#1577)
Rachmanino Jan 1, 2026
6fd0bd3
[Refactor] Use access_ptr instead of buffer and offsets for cp async …
LeiWang1999 Jan 3, 2026
5685c56
[Layout] Support annotating loop layout in frontend (#1579)
LeiWang1999 Jan 4, 2026
946f611
[Typo] Rename loop layout annotation test(#1596)
LeiWang1999 Jan 4, 2026
23c446d
[Fix] Add register to read A ptr in `test_tilelang_language_cooperati…
silentCoder-dev Jan 4, 2026
9fd6ea7
[Feat] PDL Support (#1494)
w169q169 Jan 4, 2026
e3c9a58
[Enhancement][Subtype] Enhance symbolic shape/stride handling for sub…
LeiWang1999 Jan 4, 2026
ebf3b68
[Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) (#1597)
lucifer1004 Jan 4, 2026
0092b79
[Release] Fix race condition when publishing (#1578)
oraluben Jan 4, 2026
32aec8a
Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 (…
LJC00118 Jan 4, 2026
7198aa5
[Enhancement][AMD] Add preshuffle fp8 gemm example on amd. (#1605)
Gongen-Ali Jan 5, 2026
9095c5a
[Bugfix] Mangle Single Precision Mathematical Functions of cuda math …
silentCoder-dev Jan 5, 2026
4e53b50
[Bugfix] Open Rocm ci test and fix some bugs. (#1443)
Gongen-Ali Jan 5, 2026
7cfd087
[Feature] Add more curand operations & support vectorization (#1582)
silentCoder-dev Jan 5, 2026
cfbc49b
[Enhancement] Allow `import tilelang` on CPU-only machines without CU…
XuehaiPan Jan 6, 2026
9d446f3
[BugFix] Add pre-commit to requirements-dev.txt (#1611)
asaadkhaja99 Jan 6, 2026
1b00220
[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop (#1607)
SiriusNEO Jan 6, 2026
5d80629
[Feat] Add strong checker to detect data racing in T.Parallel (#1615)
kurisu6912 Jan 6, 2026
a756074
[Feature] add `T.sync_warp` & `T.shfl_sync`; change extern pdl into i…
silentCoder-dev Jan 6, 2026
1424b6c
[RaceChecker] RaceChecker report warning rather than error for backwa…
kurisu6912 Jan 7, 2026
c8f4e23
[Fix] Update type hint handling for Python 3.10 compatibility in get_…
kurisu6912 Jan 7, 2026
91c7f71
[Refactor] Move `ConstrVisitor` to `src/transform/common/constr_visit…
silentCoder-dev Jan 7, 2026
a8bf4f6
[Feat] Improve `T.reduce_absmax` to use less abs call (#1626)
kurisu6912 Jan 7, 2026
358f899
[Bugfix] Do not consider local.var as local buffer during LowerTileOP…
LeiWang1999 Jan 7, 2026
b5be1a1
[Feature] Add hoist_broadcast_values pass (#1606)
silentCoder-dev Jan 7, 2026
1bcce8b
[Enhancement][CUDA] Support `nvidia-cuda-nvcc` as `nvcc` (#1528)
clouds56 Jan 7, 2026
566d8f2
[Bugfix] Fallback into full region when dynamic buffer read region ca…
LeiWang1999 Jan 7, 2026
aca9218
[Feat] Allow print macro call stack in device assert (#1616)
kurisu6912 Jan 7, 2026
b914318
[BugFix] Correct index_map selection for transposed A matrix in MFMA …
benenzhu Jan 8, 2026
d5503cd
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3 (#1636)
hammersam Jan 8, 2026
a56212d
[Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and en…
LeiWang1999 Jan 8, 2026
ce68b51
[Refactor] Update main function signatures in example scripts to acce…
LeiWang1999 Jan 8, 2026
6e43953
[Refactor] Unify @jit and @lazy_jit into a single @jit decorator (#1632)
LeiWang1999 Jan 9, 2026
409677d
[Bugfix] Fix pdl related intrin handling to avoid strict annotation c…
LeiWang1999 Jan 9, 2026
d06fc9c
[Bugfix] reverted unexpected tvm changes (#1651)
LeiWang1999 Jan 9, 2026
9264dfa
[Bugfix] reverted unexpected tvm changes (#1652)
LeiWang1999 Jan 9, 2026
01d651a
[Refactor] Move dtypes.py from eager to language and add bits/bytes p…
LeiWang1999 Jan 9, 2026
326be4d
[Feat] Allow dangling producer in wasp pipeline planning (#1263) (#1647)
kurisu6912 Jan 9, 2026
b246c39
[bugfix] fix smem alloc for single warp reduce (#1643)
botbw Jan 9, 2026
0fa4150
[Example] Add attention sink varlen examples (#1645)
Rachmanino Jan 9, 2026
8e9d155
[ASTPrinter] Fix IfThenElse printing and some format problems (#1640)
SiriusNEO Jan 10, 2026
a929c15
[CI] [pre-commit.ci] autoupdate (#1610)
pre-commit-ci[bot] Jan 10, 2026
efdeadc
[Enhancement] Update LetStmtNode handling in loop vectorization to su…
Rachmanino Jan 11, 2026
5e347e3
[Example] Remove redundant T.copy in `examples/deepseek_v32/sparse_ml…
GoldenStain Jan 12, 2026
fd260e3
[CUDA] Introduce simulated load/store 256bits access for CUDA compati…
LeiWang1999 Jan 12, 2026
9936636
[Enhancement] Improve unroll loop functionality for dynamic extent an…
LeiWang1999 Jan 12, 2026
0d85a28
[Bugfix] Fix missing annotations for default CallNode Visitor (#1659)
LeiWang1999 Jan 12, 2026
5e90edc
[Clean] Remove unnecessary debug print (#1661)
LeiWang1999 Jan 13, 2026
4fb96a0
[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for tra…
LeiWang1999 Jan 13, 2026
29ece98
[Refactor] Improve CallNode handling to include annotations in variou…
LeiWang1999 Jan 13, 2026
c3c8831
[EagerJIT] Add Support for Parameter Only Kernel Compilation (#1664)
kurisu6912 Jan 13, 2026
802951e
[AutoDD] Add Tilelang AutoDD to Reduce Buggy Program (#1639)
KEKE046 Jan 13, 2026
732971a
[Feature] Support `cp.reduce.async.bulk.tensor` (#1667)
Rachmanino Jan 14, 2026
4084dcd
chore: update CI cutedsl version to 4.3.5 (#1665)
lucifer1004 Jan 14, 2026
2d8d367
[CUDA] Enhance Broadcast Codegen for Symbolic Value (#1669)
LeiWang1999 Jan 14, 2026
47aaf7b
[EagerJIT] Fix bug in handling of positional arguments (#1675)
kurisu6912 Jan 15, 2026
1d71bed
[Feature] Reimplement `Threadsync` with `ConstrVisitor` (#1631)
silentCoder-dev Jan 15, 2026
f035315
[Clean][Refactor] Phaseout Legacy Pass `ParallelLoopTransformer` (#1672)
LeiWang1999 Jan 15, 2026
b27fb92
[Feature] Atomic Reduction Operations and Vectorization Enhancement (…
LeiWang1999 Jan 15, 2026
5feb225
[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass (#1677)
LeiWang1999 Jan 16, 2026
65d5aec
[Bugfix] Relax region analysis for complex expression (#1679)
LeiWang1999 Jan 16, 2026
60050f2
[Example] Add example for mHC inference kernels. (#1684)
Elevator14B Jan 16, 2026
6cd511e
[Analyzer] Fix missing assume in tvm analyzer (#1680)
kurisu6912 Jan 16, 2026
651f885
Refactor: Use centralized do_bench from tilelang.profiler (#1670)
LeiWang1999 Jan 17, 2026
2ff1dd5
[Feature] Introduce DecoupleTypeCast pass for mixed-precision vectori…
LeiWang1999 Jan 17, 2026
62b8505
[Release] Bump Version into v0.1.7.post3 (#1685)
LeiWang1999 Jan 17, 2026
04f98c3
[Release] Fix release wheels (#1687)
oraluben Jan 18, 2026
bb7f30c
[BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug (#1588)
xiuhu17 Jan 18, 2026
e1dbc95
[Bugfix] Reorganize pass for `thread_sync` (#1682)
silentCoder-dev Jan 18, 2026
f808358
[BugFix] fix warning on deepseek_v32 topk_selector.py (#1681)
sgjzfzzf Jan 18, 2026
10baa27
[tvm-ffi] Enable tvm-ffi for metal backend (#1289)
oraluben Jan 19, 2026
8051cd4
[Analyzer] Fix missing assume in tvm analyzer (#1695)
LJC00118 Jan 19, 2026
c871813
[Chore] Use python-side control flow keywords in examples for consist…
Rachmanino Jan 19, 2026
c9cec62
[Bugfix][Refactor] Always disable light storage reuse (#1691)
LeiWang1999 Jan 19, 2026
a19206a
[Enhancement] Log warnings for OOB acceses to non-global buffers (#1693)
SiriusNEO Jan 19, 2026
9a9255f
Enhance loop vectorization logic for CallNode handling (#1696)
LeiWang1999 Jan 19, 2026
6399be0
[BugFix] Fix JITKernel export_library bug (#1699)
chengyupku Jan 20, 2026
93f0047
[Enhancement] Handle vectorizable calls (#1700)
LeiWang1999 Jan 20, 2026
bf675e2
[BugFix] Fix unsafe visit else case under WarpSpecializationScope (#1…
SiriusNEO Jan 20, 2026
608ab49
[Enhancement] Use `cute::elect_one_sync()` for slightly better perfor…
Rachmanino Jan 21, 2026
470a192
[Enhancement] Remove `RewriteUnsafeSelect` Pass (#1705)
LJC00118 Jan 21, 2026
88ae8a8
[BugFix] Corrected when proving loop layout contains a fragment buffe…
LeiWang1999 Jan 21, 2026
8c15eca
[Bugfix] Improve robustness of ProveFragmentContains with fully repli…
LeiWang1999 Jan 21, 2026
f1c19fd
[BugFix] Add int64_t support for AtomicAdd (#1716)
LeiWang1999 Jan 22, 2026
82729e5
[Refactor] Introduce GemmInst enumeration and update warp partitionin…
Rachmanino Jan 22, 2026
df27da4
[Refactor] Phaseout unnecessary checks for pr #1707 (#1721)
LeiWang1999 Jan 23, 2026
4ab369d
[Refactor] re-implement vector subtype and its access method (#1722)
LeiWang1999 Jan 23, 2026
5358f5a
[EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) (#1694)
kurisu6912 Jan 23, 2026
3209b51
[Enhancement] Legalize subtype access (#1724)
LeiWang1999 Jan 23, 2026
8f27a31
[EagerJIT] Enhance auto inference of lazyjit and eager jit (#1704)
kurisu6912 Jan 23, 2026
56dbc9f
[Refactor] Enhance variable substitution in device function generatio…
LeiWang1999 Jan 23, 2026
5fe8b84
[Bugfix] Fix incorrect alignment of vectorized subtype (#1726)
LeiWang1999 Jan 23, 2026
8bd0293
[Enhancement] Add explicit global memory load/store intrinsics (ldg/s…
LeiWang1999 Jan 26, 2026
4159173
[Refactor] Remove external buffer conflict check in pipeline injectio…
LeiWang1999 Jan 26, 2026
f5790f5
[Refactor] Relocate layout transformation of `ptx_stmatrix` (#1689)
LeiWang1999 Jan 26, 2026
fe26fd8
[AMD] Add MI350/MI355 FP8 support (#1718)
hubertlu-tw Jan 26, 2026
ebf4a7c
[Bugfix] revert incorrect fast path for parallel layout inference (#1…
LeiWang1999 Jan 26, 2026
5748841
[Example] Add KDA algorithm implementation in tilelang (#1660)
wfloveiu Jan 26, 2026
a17230e
[Feature] Support E8M0 related type conversion and vectorized cast (#…
SiriusNEO Jan 27, 2026
7ec5542
[BugFix] Remove unnecessary binding in loop variable analysis and add…
kurisu6912 Jan 27, 2026
ea70dad
Add swizzle layout detection and automatic merging for layout conflic…
LeiWang1999 Jan 27, 2026
da50a34
[Bugfix] Handle offset handling for subtype ptr (#1738)
LeiWang1999 Jan 27, 2026
cccf724
[EagerJIT] Allow dummy parameter in jit kernel (#1737)
kurisu6912 Jan 27, 2026
2f59613
[Feature] Add build date to version metadata (#1742)
LeiWang1999 Jan 27, 2026
f5525ea
[BugFix] Fix FP4 related vectorized cast (#1741)
chaospointer Jan 27, 2026
350f987
[Refactor] Disable Predicated LDG PTX Lowering by default (#1739)
LeiWang1999 Jan 27, 2026
fa9660b
[Layout] Fix Layout Bugs in Parallel and Reduce (#1713)
kurisu6912 Jan 28, 2026
413ecbb
[fix]: fix deepseek_mla amd example and add aiter mla compare test (#…
ZiguanWang Jan 28, 2026
fe5bf35
[Refactor] Enhance `T.alloc_barrier` with new features and deprecate …
Rachmanino Jan 28, 2026
28ada60
[BugFix] Fix several bugs in CodeGen for CuTeDSL backend (#1746)
Rachmanino Jan 28, 2026
feb09a1
Update import for compare_tensors from test_utils_kda (#1748)
pmixer Jan 28, 2026
6b3c425
[Lint] Remove diff arguments in Ruff and sync some versions (#1751)
SiriusNEO Jan 28, 2026
1f0f5a4
[Refactor] Rename EagerJIT examples to avoid confusion (#1750)
SiriusNEO Jan 29, 2026
a55a823
[AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 …
hubertlu-tw Jan 29, 2026
3fbf562
[Feature] Support message-only debug print (#1755)
Rachmanino Jan 29, 2026
8e1358d
[EagerJIT] Update README example to eager jit (#1752)
kurisu6912 Jan 29, 2026
9ddf577
[BugFix] Stride check and fix for tensors with zero-stride argument (…
tzj-fxz Jan 29, 2026
4fa2d24
[BugFix] Always build guard in loop partitioning to prevent out-of-bo…
LeiWang1999 Jan 30, 2026
14cb3eb
[Tool] Add tool to print fragment in thread value view (#1759)
kurisu6912 Jan 30, 2026
07d75bf
[Enhancement] Add dynamic symbolic constraints support for Profiler b…
LeiWang1999 Jan 30, 2026
fbf28a1
[ThreadSync] Use Z3 for constraint equivalence checking (#1760)
LeiWang1999 Jan 30, 2026
8c45417
[Feature] Support Pass LoopUnswitching (#1747)
chengyupku Jan 31, 2026
007929f
[Chore] Remove unnecessary log from z3 (#1763)
Rachmanino Feb 1, 2026
f9e3767
[Bugfix] Revert the initial value of Z3 SetRLimit (#1765)
LeiWang1999 Feb 2, 2026
5f60c09
[Feature] Enhance Loop Unswitching with Let Binding and Condition Han…
LeiWang1999 Feb 2, 2026
c4748da
[Bugfix] Add predicate to loads inside predicated stores in LowerLDGS…
LeiWang1999 Feb 2, 2026
7bf4344
[Feature] Add PassConfig for Controlling Let Statement Inlining in Si…
LeiWang1999 Feb 2, 2026
7abf040
[Fix] Change ue8m0 default round mode to cudaRoundPosInf (#1770)
SiriusNEO Feb 2, 2026
727d108
[Feature] Support tcgen5mma lowering for `.kind::i8` (#1764)
Rachmanino Feb 2, 2026
378a8f2
[Refactor] Unify the usage of cast-related operators (#1757)
SiriusNEO Feb 2, 2026
5ba7818
[Bugfix] Copy pass_configs dict to prevent mutation across multiple J…
LeiWang1999 Feb 3, 2026
6753286
[CI] [pre-commit.ci] autoupdate (#1775)
pre-commit-ci[bot] Feb 3, 2026
3b3369e
[Refactor] Improve type annotations and reduce some lint errors in fr…
SiriusNEO Feb 3, 2026
79805d3
Update TVM: fix select/if_then_else out-of-bounds access (#1783)
LeiWang1999 Feb 3, 2026
d8fc084
[Feature] Add fully replicated layout interface in annotation layout …
tzj-fxz Feb 4, 2026
db03bba
[Example][BugFix] Fix arguements override in deepseek_v32 topk_select…
ljwljwljwljw Feb 4, 2026
215f436
[BugFix] Fix reduce_sum with clear=False not accumulating correctly (…
ShaobinChen-AH Feb 4, 2026
bb0634d
fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitt…
Coloured-glaze Feb 4, 2026
df87c56
[Enhancement] Enhance register vectorize inference (#1785)
LeiWang1999 Feb 4, 2026
191d879
[Bugfix] Fix thread storage sync conflict detection for loop carry wr…
LeiWang1999 Feb 4, 2026
841c446
[Fix] cython 3.0 generates incorrect code for python stable api (#1789)
oraluben Feb 5, 2026
c1481eb
[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle va…
xwhzz Feb 5, 2026
701756b
[ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis (#1795)
LeiWang1999 Feb 5, 2026
5951bce
[Feature] Add option to disable out-of-bound access warnings in safe …
kurisu6912 Feb 5, 2026
eb7f695
[Docs] Add Python Compatibility document of TileLang (#1745)
LeiWang1999 Feb 5, 2026
b01cb93
[Refactor] Reorganize ParallelOp code structure and move ProveFragmen…
LeiWang1999 Feb 6, 2026
4349b2c
[Feature] Support passing PrimExpr value in tile-level atomic operati…
SiriusNEO Feb 6, 2026
af30ac2
[Bugfix] Support loop-dependent conditions in IfThenElse within T.Pip…
ljwljwljwljw Feb 6, 2026
c47d6df
[BugFix] Missing Recursive Loop Var Checking in Loop Unswitching (#1801)
kurisu6912 Feb 6, 2026
7950dc5
Fix a 3.9 issue. add `_typing.py` to dist check (#1803)
oraluben Feb 7, 2026
46a4e76
[Docs][Puzzles] Add TileLang puzzles in README (#1806)
SiriusNEO Feb 7, 2026
4786915
[Docs] Hotfix wrong link (#1807)
SiriusNEO Feb 7, 2026
d85e0c6
[Enhancement] Improve plot_layout visualization for Layouts (#1811)
LeiWang1999 Feb 8, 2026
fbd5334
[Feat] profiler support cudagraph backend (#1658)
cscyuge Feb 8, 2026
2172406
Handle staled autotune state with tvm-ffi adapter. (#1812)
haok1402 Feb 9, 2026
e9d0569
[BugFix] LoopUnswitching: gate non-trivial else behind PassConfig (#1…
LeiWang1999 Feb 9, 2026
790388c
[Release] Update dependencies to resolve several issues (#1817)
oraluben Feb 9, 2026
0c1df35
[Enhancement] Integrate arith::Analyzer into Loop Vectorizer for impr…
kurisu6912 Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions .clang-tidy
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ ExtraArgs: []
FormatStyle: file
UseColor: true
WarningsAsErrors: '*'
# FIXME: Use `ExcludeHeaderFilterRegex` instead when all maintainers upgraded their `clang-tidy`
HeaderFilterRegex: '^(?!.*(?:/|^)(3rdparty|tvm)/).*'
# ExcludeHeaderFilterRegex: '^(3rdparty|tvm)/.*$'

# NOTE: there must be no spaces before the '-', so put the comma last.
Checks: >-
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -1 +1 @@
blank_issues_enabled: false
blank_issues_enabled: true
92 changes: 46 additions & 46 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,23 +40,13 @@ jobs:
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v5
uses: actions/checkout@v6
with:
fetch-depth: 0
submodules: recursive

- name: Setup Python 3.8
id: setup-pylowest
uses: actions/setup-python@v6
with:
python-version: "3.8" # use lowest supported version for linting
update-environment: false

- name: Check AST with Python 3.8
run: |
"${{ steps.setup-pylowest.outputs.python-path }}" -m compileall -q -f tilelang

- name: Setup Python 3.9
id: setup-pylowest
uses: actions/setup-python@v6
with:
python-version: "3.9"
Expand All @@ -67,6 +57,10 @@ jobs:
requirements*.txt
.pre-commit-config.yaml

- name: Check AST with Python 3.9
run: |
"${{ steps.setup-pylowest.outputs.python-path }}" -m compileall -q -f tilelang

- name: Pre-commit Lint
run: |
if ! pipx run pre-commit run --all-files --color=always --show-diff-on-failure; then
Expand All @@ -93,7 +87,7 @@ jobs:
name: self-hosted-amd
# Format: [Nightly-]ROCm-<major>.<minor>[.<patch>]. E.g., "ROCm-6.4" or "Nightly-ROCm-7.0".
# Use "Nightly-" prefix to use torch nightly builds.
toolkit: ROCm-6.3
toolkit: Nightly-ROCm-7.1
- tags: [macos-latest]
name: macos-latest
toolkit: Metal # or Nightly-Metal
Expand All @@ -104,7 +98,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@v5
uses: actions/checkout@v6
with:
fetch-depth: 0
submodules: recursive
Expand Down Expand Up @@ -288,62 +282,66 @@ jobs:
echo "Clearing uv cache at ${UV_CACHE_DIR} due to failure."
uv cache clean

- name: Enable core dump generation (Linux / GitHub-hosted runners)
if: ${{ runner.os == 'Linux' && !startsWith(matrix.runner.name, 'self-hosted') }}
run: |
sudo sysctl -w kernel.core_pattern="core.${{ matrix.python-version }}.${{ matrix.runner.toolkit }}.%P"
sudo sysctl -w kernel.core_uses_pid=0
sudo sysctl -w fs.suid_dumpable=1
sysctl kernel.core_pattern kernel.core_uses_pid fs.suid_dumpable

- name: Enable core dump generation (macOS / GitHub-hosted runners)
if: ${{ runner.os == 'macOS' && !startsWith(matrix.runner.name, 'self-hosted') }}
run: |
sudo sysctl -w kern.corefile="core.${{ matrix.python-version }}.${{ matrix.runner.toolkit }}.%P"
sudo sysctl -w kern.coredump=1
sudo sysctl -w kern.sugid_coredump=1
sysctl kern.corefile kern.coredump kern.sugid_coredump

- name: Install project (wheel form)
run: |
uv pip install -v .

- name: Run clang-tidy
id: clang-tidy
if: runner.os == 'Linux'
run: |
echo "\$ $(command -v clang-tidy) --version" && clang-tidy --version

if [[ -x "$(command -v run-clang-tidy)" ]]; then
echo "Using run-clang-tidy from $(command -v run-clang-tidy)"
CLANG_TIDY=(run-clang-tidy)
else
RCT_URL=https://raw.githubusercontent.com/llvm/llvm-project/refs/heads/release/21.x/clang-tools-extra/clang-tidy/tool/run-clang-tidy.py
echo "Downloading run-clang-tidy script from ${RCT_URL}"
echo "import urllib.request; url = '${RCT_URL}'.rstrip('/'); urllib.request.urlretrieve(url, url.split('/')[-1])" | uv run --no-project --script -
CLANG_TIDY=(uv run --no-project --script -- run-clang-tidy.py)
fi
# Download run-clang-tidy script
RCT_URL=https://raw.githubusercontent.com/llvm/llvm-project/refs/heads/release/21.x/clang-tools-extra/clang-tidy/tool/run-clang-tidy.py
echo "Downloading run-clang-tidy script from ${RCT_URL}"
echo "import urllib.request; url = '${RCT_URL}'.rstrip('/'); urllib.request.urlretrieve(url, url.split('/')[-1])" | uv run --no-project --script -
RUN_CLANG_TIDY=(uv run --no-project --script -- run-clang-tidy.py)

if [[ -x "$(command -v clang-apply-replacements)" ]]; then
echo "Using clang-apply-replacements from $(command -v clang-apply-replacements)"
CLANG_TIDY+=(-fix -clang-apply-replacements-binary="$(command -v clang-apply-replacements)")
RUN_CLANG_TIDY+=(-fix -clang-apply-replacements-binary="$(command -v clang-apply-replacements)")
else
echo "::warning::clang-apply-replacements not found in PATH, automatic fixing disabled."
fi

# Run cmake to create the build directory with compile_commands.json
cmake -S . -B cmake-build --fresh ${CLANG_TIDY_CMAKE_OPTIONS} # no quotes here
echo "::group::compile_commands.json"
ls -alh cmake-build/compile_commands.json
uv run --no-project -m -- json.tool --no-ensure-ascii cmake-build/compile_commands.json
echo "::endgroup::"

CXX_FILES=$(find src -type f -iname "*.[ch]pp" -o -iname "*.cc" -o -iname "*.c" -o -iname "*.h")
rc=0
"${CLANG_TIDY[@]}" -clang-tidy-binary="$(command -v clang-tidy)" \
echo "::group::run-clang-tidy"
"${RUN_CLANG_TIDY[@]}" -clang-tidy-binary="$(command -v clang-tidy)" \
-exclude-header-filter='^(3rdparty|tvm)/.*$' \
-p="cmake-build" ${CXX_FILES} || rc="$?"
echo "::endgroup::"
rm -rf cmake-build run-clang-tidy.py
if (( rc != 0 )); then
echo "::error::clang-tidy found issues (exit code: ${rc}). Please run 'clang-tidy --fix' locally to fix them."
git diff --color=always || true
exit "${rc}"
fi

- name: Enable core dump generation (Linux / GitHub-hosted runners)
if: ${{ runner.os == 'Linux' && !startsWith(matrix.runner.name, 'self-hosted') }}
run: |
sudo sysctl -w kernel.core_pattern="core.${{ matrix.python-version }}.${{ matrix.runner.toolkit }}.%P"
sudo sysctl -w kernel.core_uses_pid=0
sudo sysctl -w fs.suid_dumpable=1
sysctl kernel.core_pattern kernel.core_uses_pid fs.suid_dumpable

- name: Enable core dump generation (macOS / GitHub-hosted runners)
if: ${{ runner.os == 'macOS' && !startsWith(matrix.runner.name, 'self-hosted') }}
run: |
sudo sysctl -w kern.corefile="core.${{ matrix.python-version }}.${{ matrix.runner.toolkit }}.%P"
sudo sysctl -w kern.coredump=1
sudo sysctl -w kern.sugid_coredump=1
sysctl kern.corefile kern.coredump kern.sugid_coredump

- name: Install project (wheel form)
run: |
uv pip install -v .

- name: Run examples with Python ${{ matrix.python-version }} (${{ matrix.runner.toolkit }})
if: contains(matrix.runner.toolkit, 'CUDA')
run: |
Expand All @@ -369,6 +367,7 @@ jobs:
./python

# AMD ROCm tests
# runtime and transform tests needs to repair, then rm it from ignore list
- name: Run ROCm tests with Python ${{ matrix.python-version }} (${{ matrix.runner.toolkit }})
id: rocm-tests
if: contains(matrix.runner.toolkit, 'ROCm')
Expand All @@ -379,7 +378,8 @@ jobs:
pytest --verbose --color=yes --durations=0 --showlocals --cache-clear
)
"${PYTEST[@]}" --maxfail=3 --numprocesses=4 \
./python/amd/test_tilelang_test_amd.py
--ignore=./python/runtime --ignore=./python/transform \
./python

# Apple Metal tests
- name: Run Metal tests with Python ${{ matrix.python-version }} (${{ matrix.runner.toolkit }})
Expand Down
98 changes: 79 additions & 19 deletions .github/workflows/dist.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
name: Dist
on:
workflow_dispatch:
schedule:
# gemini said this is 6:00 china time
- cron: "0 22 * * *"
Expand All @@ -17,6 +18,9 @@ on:
- CMakeLists.txt
- version_provider.py
- .github/workflows/dist.yml
# temporarily add to dist check
# until we have type checking in ci / move to python 3.10
- tilelang/_typing.py
release:
types:
- published
Expand All @@ -34,6 +38,11 @@ env:
COLUMNS: "100"
FORCE_COLOR: "1"
CLICOLOR_FORCE: "1"
UV_INDEX_STRATEGY: "unsafe-best-match"
UV_HTTP_TIMEOUT: "600"
XDG_CACHE_HOME: "${{ github.workspace }}/.cache" # to be updated
PIP_CACHE_DIR: "${{ github.workspace }}/.cache/pip" # to be updated
UV_CACHE_DIR: "${{ github.workspace }}/.cache/uv" # to be updated

jobs:
build-sdist:
Expand All @@ -52,7 +61,7 @@ jobs:

steps:
- name: Checkout repository
uses: actions/checkout@v5
uses: actions/checkout@v6
with:
fetch-depth: 1
submodules: recursive
Expand All @@ -71,6 +80,7 @@ jobs:
- name: Setup ccache
uses: hendrikmuhs/ccache-action@v1
with:
max-size: "200MB"
create-symlink: true
evict-old-files: "7d"
append-timestamp: false
Expand All @@ -91,7 +101,7 @@ jobs:
- name: Upload SDist
# Not PR to save artifact storage, as SDist is only needed for releases.
if: github.event_name != 'pull_request' || contains(github.event.pull_request.title, '[Release]')
uses: actions/upload-artifact@v5
uses: actions/upload-artifact@v6
with:
name: sdist
path: dist/*.tar.gz
Expand All @@ -105,41 +115,40 @@ jobs:
strategy:
matrix:
target:
- { runner: ubuntu-latest, toolkit: "CUDA-12.1" }
- { runner: ubuntu-latest, toolkit: "CUDA-12.8" }
- { runner: ubuntu-24.04-arm, toolkit: "CUDA-12.8" }
- { runner: ubuntu-latest, toolkit: "Nightly-CUDA-13.0" }
- { runner: ubuntu-24.04-arm, toolkit: "Nightly-CUDA-13.0" }
- { runner: macos-latest, toolkit: "Metal" }
python-version:
# Wheels are built with Python 3.8 Limited API, they should work with all Python >= 3.8.
# Only build wheels against Python 3.8 Limited API to save CI resources.
# FIXME: Here we use Python 3.9 because our dependency `apache-tvm-ffi` claims to support
# Python 3.8 but it depends on a version of `ml-dtypes` that requires Python >= 3.9.
- "3.9"
fail-fast: false
timeout-minutes: 120
runs-on: ${{ matrix.target.runner }}
env:
NO_VERSION_LABEL: ${{ github.event_name == 'release' && 'OFF' || 'ON' }}
IS_RELEASE: ${{ github.event_name != 'pull_request' || contains(github.event.pull_request.title, '[Release]') }}
NO_VERSION_LABEL: "OFF"

steps:
- name: Checkout repository
uses: actions/checkout@v5
uses: actions/checkout@v6
with:
fetch-depth: 1
submodules: recursive

- name: Setup ccache
uses: hendrikmuhs/ccache-action@v1
with:
max-size: "200MB"
create-symlink: true
evict-old-files: "7d"
append-timestamp: false
key: wheel-${{ runner.os }}-${{ runner.arch }}-${{ matrix.target.toolkit }}-${{ hashFiles('**/*.cc') }}
key: wheel-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('**/*.cc') }}
restore-keys: |
wheel-${{ runner.os }}-${{ runner.arch }}-${{ matrix.target.toolkit }}-${{ hashFiles('**/*.cc') }}
wheel-${{ runner.os }}-${{ runner.arch }}-${{ matrix.target.toolkit }}
wheel-${{ runner.os }}-${{ runner.arch }}-${{ hashFiles('**/*.cc') }}
wheel-${{ runner.os }}-${{ runner.arch }}
${{ runner.os }}-${{ runner.arch }}-${{ matrix.target.toolkit }}
${{ runner.os }}-${{ runner.arch }}

- name: Set CIBW_BUILD
run: |
Expand All @@ -150,26 +159,77 @@ jobs:

if [[ "${{ matrix.target.toolkit }}" == *"CUDA"* ]]; then
CUDA_VERSION="${{ matrix.target.toolkit }}"
CUDA_VERSION="${CUDA_VERSION#CUDA-}"
CUDA_VERSION="${CUDA_VERSION##*-}"
CUDA_VERSION_MAJMIN="$(echo ${CUDA_VERSION} | cut -d '.' -f-2)"
CUDA_VERSION_MAJMIN_NODOT="${CUDA_VERSION_MAJMIN//./}"
echo "CUDA_VERSION=${CUDA_VERSION}" | tee -a "${GITHUB_ENV}"
if [[ "${{ matrix.target.toolkit }}" == "Nightly-"* ]]; then
# Use torch nightly builds
export UV_INDEX="https://download.pytorch.org/whl/nightly/cu${CUDA_VERSION_MAJMIN_NODOT}"
else
export UV_INDEX="https://download.pytorch.org/whl/cu${CUDA_VERSION_MAJMIN_NODOT}"
echo "UV_TORCH_BACKEND=cu${CUDA_VERSION_MAJMIN_NODOT}" | tee -a "${GITHUB_ENV}"
fi
echo "UV_INDEX=${UV_INDEX}" | tee -a "${GITHUB_ENV}"
fi

if [[ "${{ env.IS_RELEASE }}" == "true" ]]; then
if [[ "${{ matrix.target.toolkit }}" == "Nightly-"* ]]; then
# Avoid using same file name for different toolkit.
echo "NO_GIT_VERSION=ON" | tee -a "${GITHUB_ENV}"
else
echo "NO_VERSION_LABEL=ON" | tee -a "${GITHUB_ENV}"
fi
fi

if [[ "${{ runner.os }}" == "Linux" ]]; then
HOST_CCACHE_DIR="$(ccache --get-config cache_dir)"
echo "CIBW_BEFORE_BUILD_LINUX=yum install -y ccache && ccache -o cache_dir=/host${HOST_CCACHE_DIR}" | tee -a "${GITHUB_ENV}"
echo "CIBW_BEFORE_BUILD_LINUX=dnf install -y ccache && ccache -o cache_dir=/host${HOST_CCACHE_DIR}" | tee -a "${GITHUB_ENV}"
fi

- name: Build wheels
uses: pypa/cibuildwheel@v3.2
uses: pypa/cibuildwheel@v3.3
with:
package-dir: .
output-dir: wheelhouse
config-file: "{package}/pyproject.toml"

- name: Setup Python and uv with caching
id: setup-uv
uses: astral-sh/setup-uv@v7
with:
python-version: "3.12"
activate-environment: true

- name: Test built wheels
run: |
for WHEEL in wheelhouse/*.whl; do
echo "Testing wheel: ${WHEEL}"
(
set -e
uv venv --python=3.12 test-venv
source test-venv/bin/activate

uv pip install --upgrade pip setuptools wheel
if [[ "${UV_INDEX}" == *"/nightly/"* ]]; then
uv pip install --prerelease=allow -v torch
fi

uv pip install -v "${WHEEL}"
(
set -e
cd /
uv run --no-project -- python -c "import tilelang; print(tilelang.__version__)"
)
deactivate
rm -rf test-venv
)
done

- name: Upload wheels
# Not PR to save artifact storage, as wheels are only needed for releases.
if: github.event_name != 'pull_request' || contains(github.event.pull_request.title, '[Release]')
uses: actions/upload-artifact@v5
uses: actions/upload-artifact@v6
with:
name: wheels-${{ matrix.python-version }}-${{ runner.os }}-${{ runner.arch }}-${{ matrix.target.toolkit }}
path: wheelhouse/*.whl
Expand All @@ -184,15 +244,15 @@ jobs:
timeout-minutes: 15
steps:
- name: Download built SDist
uses: actions/download-artifact@v6
uses: actions/download-artifact@v7
with:
# unpacks default artifact into dist/
# if `name: artifact` is omitted, the action will create extra parent dir
name: sdist
path: dist

- name: Download built wheels
uses: actions/download-artifact@v6
uses: actions/download-artifact@v7
with:
pattern: wheels-*
path: dist
Expand All @@ -202,7 +262,7 @@ jobs:
run: ls -lh dist/*

- name: Upload artifacts
uses: actions/upload-artifact@v5
uses: actions/upload-artifact@v6
with:
name: artifacts
path: dist/*
Expand Down
Loading