[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr #2

kurisu6912 · 2026-02-11T05:13:21Z

[Language] Add type stubs for tir op ([Language] Add type stubs for tir op tile-ai/tilelang#1239)
[Enhancement] Support Layout/Fragment Reshape ([Enhancement] Support Layout/Fragment Reshape tile-ai/tilelang#1241)
[Bugfix] Minor fix for tcgen05 ([Bugfix] Minor fix for tcgen05 tile-ai/tilelang#1242)
RMSNorm epsilon refine in the example (RMSNorm epsilon refine in the example tile-ai/tilelang#1243)
[AMD] enable amd ci test & fix bug & fix dockerfile ([AMD] enable amd ci test & fix bug & fix dockerfile tile-ai/tilelang#1244)
[Refactor] Phaseout legacy loop vectorize dynamic pass ([Refactor] Phaseout legacy loop vectorize dynamic pass tile-ai/tilelang#1245)
[Bugfix] Fix fp8 dtype for some cases ([Bugfix] Fix fp8 dtype for some cases tile-ai/tilelang#1246)
[Minor] Remove git_commit.txt ([Minor] Remove git_commit.txt tile-ai/tilelang#1249)
[Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape ([Language][Reshape] Improve variable handling and ensure correctness during Layout Reshape tile-ai/tilelang#1248)
[Refactor] Update buffer handling in copy and atomic operations ([Refactor] Update buffer handling in copy and atomic operations tile-ai/tilelang#1247)
[Language] Add missing while statement ([Language] Add missing while statement tile-ai/tilelang#1254)
[BugFix] Add autotune and exp2 for GDN kernel ([BugFix] Add autotune and exp2 for GDN kernel tile-ai/tilelang#1258)
[BugFix] Refactor attention kernel to handle OOB positions by filling with -inf instead of clearing accumulators. ([BugFix] Refactor attention kernel to handle OOB positions by filling with -inf instead of clearing accumulators. tile-ai/tilelang#1222)
[fix] NVRTC execution backend ([fix] NVRTC execution backend tile-ai/tilelang#1256)
[AMD] Update CK for ROCm7 ([AMD] Update CK for ROCm7 tile-ai/tilelang#1262)
[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd ([BugFix] Remove memory_order in atomic constexpr and fix NSA bwd tile-ai/tilelang#1260)
[Example] Add GQA decoding kernel with varlen page table ([Example] Add GQA decoding kernel with varlen page table tile-ai/tilelang#1265)
[Refactor] add support for numpy dtype conversion ([Refactor] add support for numpy dtype conversion tile-ai/tilelang#1255)
[EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability ([EXAMPLE] In the flash attention example keep the max of all blocks seen in scores_max numerical stability tile-ai/tilelang#1148)
[Docs] Improve Installation Guide ([Docs] Improve Installation Guide tile-ai/tilelang#1270)
[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity ([Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity tile-ai/tilelang#1269)
[Bugfix] Fix multiple cg defination when using T.sync_grid ([Bugfix] Fix multiple cg defination when using T.sync_grid tile-ai/tilelang#1272)
[Minor] Remove from future import annotations for python 3.8 ([Minor] Remove from __future__ import annotations for python 3.8 tile-ai/tilelang#1273)
[BugFix] Adding extra parameters into autotune hashkey ([BugFix] Adding extra parameters into autotune hashkey tile-ai/tilelang#1274)
Fix various issues under int64_t static and dynamic shape. (Fix various issues under int64_t static and dynamic shape. tile-ai/tilelang#1218)
Bug fix for Gated Delta Net benchmark script (Bug fix for Gated Delta Net benchmark script tile-ai/tilelang#1267)
[Bugfix] Minor fix for some cases ([Bugfix] Minor fix for some cases tile-ai/tilelang#1278)
[Language] Add shape check in T.view/reshape ([Language] Add shape check in T.view/reshape tile-ai/tilelang#1277)
[FFI] Use tvm ffi as the default execution backend ([FFI] Use tvm ffi as the default execution backend tile-ai/tilelang#1259)
[Bugfix] Supply missing T.print for bool type ([Bugfix] Supply missing T.print for bool type tile-ai/tilelang#1279)
[Fix] Fix memory leak bug ([Fix] Fix memory leak bug tile-ai/tilelang#1281)
[Enhancement] Enhance CUDA compilation by integrating pass context configuration ([Enhancement] Enhance CUDA compilation by integrating pass context configuration tile-ai/tilelang#1283)
Fix the bug in issue [BUG] example_tilelang_nsa_fwd.py is inconsistent with the reference implementation and specific parameters . tile-ai/tilelang#1266 (Fix the bug in issue #1266 tile-ai/tilelang#1284)
[Language][UX] Nested loop checker in pre-lowering stage ([Language][UX] Nested loop checker in pre-lowering stage tile-ai/tilelang#1288)
[Compatibility] Support CUDA 11.3 ([Compatibility] Support CUDA 11.3 tile-ai/tilelang#1290)
[Feat] Add support for using T.Tensor(n * 2 + 1) in function annotation ([Feat] Add support for using T.Tensor(n * 2 + 1) in function annotation tile-ai/tilelang#1285)
[Feat] add support for passing reference in T.Var annotation ([Feat] Add missing support to pass reference by T.Var annotation tile-ai/tilelang#1291)
[Enhancement] Shared Memory Size Can be Dynamic ([Enhancement] Shared Memory Size Can be Dynamic tile-ai/tilelang#1294)
[Fix] Remove unused let_bindings_ in CodeGenC to fix [BUG] Weird TVM internal Error tile-ai/tilelang#1300 ([Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 tile-ai/tilelang#1305)
[Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures ([Bugfix] Fallback to the old AtomicAdd implementation for legacy architectures tile-ai/tilelang#1306)
[Fix] Fix frame scope error in T.macro ([Fix] Fix frame scope error in T.macro tile-ai/tilelang#1308)
[WIP] support more dtypes for tcgen05 ([WIP] support more dtypes for tcgen05 tile-ai/tilelang#1229)
Improve memory access safety and T.assume handling (Improve memory access safety and T.assume handling tile-ai/tilelang#1292)
[Bugfix] Fix autotune cache ([Bugfix] Fix autotune cache tile-ai/tilelang#1315)
[Refactor] Backup Analyzer to get the appropriate arith informations ([Refactor] Backup Analyzer to get the appropriate arith informations tile-ai/tilelang#1311)
Revert "[WIP] support more dtypes for tcgen05 ([WIP] support more dtypes for tcgen05 tile-ai/tilelang#1229)" (Revert "[WIP] support more dtypes for tcgen05 (#1229)" tile-ai/tilelang#1323)
[CI]: Bump actions/checkout from 5 to 6 ([CI]: Bump actions/checkout from 5 to 6 tile-ai/tilelang#1319)
[CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 ([CI]: Bump pypa/cibuildwheel from 3.2 to 3.3 tile-ai/tilelang#1318)
[Installation] Fix building using customized TVM path ([Installation] Fix building using customized TVM path tile-ai/tilelang#1326)
[Release] Allow developer with write permission to trigger wheel release ([Release] Allow developer with write permission to trigger wheel release tile-ai/tilelang#1322)
[Feat] Support warp reduce ([Feat] Support warp reduce tile-ai/tilelang#1316)
[Enhancement] Support more dtype in T.print ([Enhancement] Support more dtype in T.print tile-ai/tilelang#1329)
[BugFix] Use BufferRegion in tl.cumsum to infer buffer shape ([BugFix] Use BufferRegion in tl.cumsum to infer buffer shape tile-ai/tilelang#1321)
[Fix] fix wrong uint narrowing bug in tvm in [BUG] Wrong type recognition in T.serial tile-ai/tilelang#1310 ([Fix] Fix uint narrowing bug in #1310 tile-ai/tilelang#1320)
[Refactor] Disable strided buffer load inside tvm ([BUG] strided index cause silent bug tile-ai/tilelang#1301) ([Refactor] Disable strided buffer load inside tvm (#1301) tile-ai/tilelang#1332)
[Refactor] Moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils ([Refactor] Moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils tile-ai/tilelang#1333)
[Fix] Fix bug copying from or to local buffer ([BUG] T.copy has bad behavior from global memory to local memory tile-ai/tilelang#1304) ([Fix] Fix bug copying from or to local buffer (#1304) tile-ai/tilelang#1324)
[Language][UX] Semantic check for parallel fragment access ([Language][UX] Semantic check for parallel fragment access tile-ai/tilelang#1338)
Add unit tests for T.assume (Add unit tests for T.assume tile-ai/tilelang#1341)
[Feat] Extend LegalizeNegativeIndex to support buffer store stmts ([Feat] Extend LegalizeNegativeIndex to support buffer store stmts tile-ai/tilelang#1339)
[Refactor] Phaseout vmap for Tile Operators ([Refactor] Phaseout vmap for Tile Operators tile-ai/tilelang#1334)
[Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 ([Enhancement] add more dtype and fix mma.ws for fp16 for tcgen05 tile-ai/tilelang#1327)
[Refactor] Enhance CopyNode's IterVar Creation and Range Handling ([Refactor] Enhance CopyNode's IterVar Creation and Range Handling tile-ai/tilelang#1346)
[Fix] Fix missing not rewrite in frontend ([Fix] Fix missing not operator in frontend (#1347) tile-ai/tilelang#1348)
[Enhancement] Add support for k_pack in gemm_mfma ([Enhancement] Add support for k_pack in gemm_mfma tile-ai/tilelang#1344)
Add sparse fine-tuning kernel for deepseek sparse attention to example (Add sparse fine-tuning kernel for deepseek sparse attention to example tile-ai/tilelang#1296)
[Refactor] Improve assertion handling in CodeGenCHost and ArgBinder ([Refactor] Improve assertion handling in CodeGenCHost and ArgBinder tile-ai/tilelang#1352)
[Refactor] Simplify index sign state handling in LegalizeNegativeIndex ([Refactor] Simplify index sign state handling in LegalizeNegativeIndex tile-ai/tilelang#1354)
[Enhancement] Improve error handling and assertion messages across runtime and argument binding ([Enhancement] Improve error handling and assertion messages across runtime and argument binding tile-ai/tilelang#1356)
[Bugfix] Disable floordiv optimization due to integer overflow risk ([Bugfix] Disable floordiv optimization due to integer overflow risk tile-ai/tilelang#1355)
[Bugfix] Fix the jit_kernel issue ([Bugfix] Fix the jit_kernel issue tile-ai/tilelang#1357)
[Refactor] Update Fragment Indexing in ParallelOpNode's InferLayout Method ([Bugfix] Bind thread range for fragment inference in Parallel strict layout inference stage. tile-ai/tilelang#1359)
[Analysis] Enhance NestedLoopChecker with tile op cases ([Analysis] Enhance NestedLoopChecker with tile op cases tile-ai/tilelang#1358)
[Language] support T.gemm_sp_v2 on sm80 and sm89 ([Language] support T.gemm_sp_v2 on sm80 and sm89 tile-ai/tilelang#1056)
[Bugfix] Update TIR registration for GemmSPPy to use tile operation ([Bugfix] Update TIR registration for GemmSPPy to use tile operation tile-ai/tilelang#1361)
[Enhancement] Implement dynamic unroll factor in CUDA code generation ([Enhancement] Implement dynamic unroll factor in CUDA code generation tile-ai/tilelang#1360)
[CI] [pre-commit.ci] autoupdate ([CI] [pre-commit.ci] autoupdate tile-ai/tilelang#1362)
[Bugfix] Remove debug print in PyStmtFunctionVisitor ([Bugfix] Remove debug print in PyStmtFunctionVisitor tile-ai/tilelang#1363)
[Debug] Always include line info in NVCC command for improved profiling and mapping ([Debug] Always include line info in NVCC command for improved profiling tile-ai/tilelang#1364)
[Refactor] Update condition for benchmarking in example_gemv.py and simplify cached library path handling in sparse.py ([Enhancemnet] Minor fix to speed up testing tile-ai/tilelang#1365)
[Enhancement] Add DISABLE_CACHE environment variables ([Enhancement] Add DISABLE_CACHE environment variables tile-ai/tilelang#1368)
[Refactor]: Remove useless include in atomicadd_vectorize.h ([Refactor]: Remove useless include in atomicadd_vectorize.h tile-ai/tilelang#1371)
[Refactor] Generalize fp8 process ([Refactor] Generalize fp8 process tile-ai/tilelang#1372)
[Layout] Enhance Free Layout Inference ([Layout] Enhance Free Layout Inference tile-ai/tilelang#1375)
[Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations ([Enhancement] Introduce buffer var lca analysis for pass plan buffer allocations tile-ai/tilelang#1376)
[Tool] Provide layout visualization tool ([Tool] Provide layout visualization tool tile-ai/tilelang#1353)
[Release] Relax constraint of tvm-ffi to compatible version ([Release] Relax constraint of tvm-ffi to compatible version tile-ai/tilelang#1373)
[Language] Tilelang LazyJIT Experimental Version ([Language] Tilelang LazyJIT Experimental Version tile-ai/tilelang#1337)
[Builder] Enhance variable name binding and scope management ([Builder] Enhance variable name binding and scope management tile-ai/tilelang#1378)
[Bugfix] make cuda driver api compat with cuda12/13, along with tests ([Bugfix] make cuda driver api compat with cuda12/13, along with tests tile-ai/tilelang#1379)
[Fix] typo in cuda attr ([Fix] typo in cuda attr tile-ai/tilelang#1380)
[Language V2] Minor fix for complex annotations ([Language V2] Minor fix for complex annotations tile-ai/tilelang#1381)
[Release] Bump Version into 0.1.7 ([Release] Bump Version into 0.1.7 tile-ai/tilelang#1377)
[Typing] Enhance compatibility for advanced typing features in Python ([Typing] Enhance compatibility for advanced typing features for Py39 tile-ai/tilelang#1382)
[Bugfix][Build] Update CMake configuration to remove project root injection for sys.path ([Bugfix][Build] Update CMake configuration to remove project root injection for sys.path tile-ai/tilelang#1385)
[BugFix] Fix split kernel layout bug of GQA decode ([BugFix] Fix split kernel layout bug of GQA decode tile-ai/tilelang#1386)
[Enhancement] Add debug output methods for Layout and Fragment classes ([Feat] Add better repr print for Layout and Fragment tile-ai/tilelang#1392)
[Doc] Update logging docs ([Doc] Logging docs for Tilelang/TVM tile-ai/tilelang#1395)
[Enhancement] Refactor inflight computing to support dynamic pipeline extents ([Enhancement] Refactor inflight computing to support dynamic pipeline extents tile-ai/tilelang#1399)
[AMD] Fix 3 bugs when build docker on amd mi3x gpu ([AMD] Fix 3 bugs when build docker on amd mi3x gpu tile-ai/tilelang#1401)
[Typo] Fix tilelang link in README.md ([Typo] Fix tilelang link in README.md tile-ai/tilelang#1402)
[Dependency] Update apache-tvm-ffi version to >=0.1.2 ([Dependency] Update apache-tvm-ffi version to >=0.1.2 tile-ai/tilelang#1400)
[AMD] Enable FA2 fwd on AMD MI300X ([AMD] Enable FA2 fwd on AMD MI300X tile-ai/tilelang#1406)
[TypoFix] fix typo for SM120 ([Typo] fix typo for SM120 tile-ai/tilelang#1408)
[Doc] Minor documentation update ([Doc] Minor documentation update tile-ai/tilelang#1410)
[Dependency] Add torch-c-dlpack-ext to project requirements ([Dependency] Add torch-c-dlpack-ext to project requirements tile-ai/tilelang#1403)
[Dependency] Update TVM subproject to latest commit 2b1ead1a ([Bugfix] Alloc T.make_tensor not on the top of prim_func tile-ai/tilelang#1412)
[Enhancement] Introduce T.__ldg ([Enhancement] Introduce T.__ldg tile-ai/tilelang#1414)
[Enhancement] Improve vectorization invariant check ([Enhancement] Improve vectorization invariant check tile-ai/tilelang#1398)
[Lint] Phaseout Yapf format and embrace ruff format ([Lint] Phaseout Yapf format and embrace ruff format tile-ai/tilelang#1417)
[Atomic] Use ptr for atomicAdd dst instead of reference ([Atomic] Use ptr for atomicAdd dst instead of reference tile-ai/tilelang#1425)
[CUDA] Add read-only parameter annotation for CUDA codegen ([CUDA] Add read-only parameter annotation for CUDA codegen tile-ai/tilelang#1416)
[Refactor] Phase out the primitives folder since its design has been merged into tileop ([Refactor] Phase out the primitives folder since its design has been merged into tileop tile-ai/tilelang#1429)
[CI]: Bump actions/upload-artifact from 5 to 6 ([CI]: Bump actions/upload-artifact from 5 to 6 tile-ai/tilelang#1431)
[CI]: Bump actions/download-artifact from 6 to 7 ([CI]: Bump actions/download-artifact from 6 to 7 tile-ai/tilelang#1432)
[Bugfix] Convey compile_flags to ffi compilation path with pass_configs ([Bugfix] Convey compile_flags to ffi compilation path with pass_configs tile-ai/tilelang#1434)
[Enhancement] Improve buffer usage tracking in MakePackedAPI ([Enhancement] Improve buffer usage tracking in MakePackedAPI tile-ai/tilelang#1435)
[Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice ([Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice tile-ai/tilelang#1405)
[Enhancement] Include PrimFunc name in memory cache logs for better debugging ([Enhancement] Include PrimFunc name in memory cache logs for better ebugging tile-ai/tilelang#1437)
[CI] Update lint dependencies and fix lint on trunk ([CI] Update lint dependencies and fix lint on trunk tile-ai/tilelang#1433)
[Enhancement] Refactor vectorization checks in loop_vectorize ([Enhancement] Refactor vectorization checks in loop_vectorize tile-ai/tilelang#1440)
** Enhance vectorized conversion support ([Enhancement] Implement vectorized FP8 to FP32 cast tile-ai/tilelang#1438)**
[Feature] Support region as input of T.cumsum ([Feature] Support region as input of T.cumsum tile-ai/tilelang#1426)
[Fix] Fix analyzer bind conflicting ([Fix] Fix analyzer bind conflicting bug in #1442 tile-ai/tilelang#1446)
[Refactor] Reduce direct dependency on PyTorch due to its limited type support ([Refactor] Reduce direct dependency on PyTorch due to its limited type support tile-ai/tilelang#1444)
[Refactor] Use pytest.mark.parameterize to speedup parallel testing ([Refactor] Use pytest.mark.parameterize to speedup parallel testing tile-ai/tilelang#1447)
[Docs] Improve installation instructions for developers ([Docs] Improve installation instructions for developers tile-ai/tilelang#1450)
[Feat] Integrate Z3 in TVM Arith Analyzer ([Feat] Integrate Z3 in TVM Arith Analyzer tile-ai/tilelang#1367)
[Bugfix] Improve autotune from elementwise_add function in examples ([Bugfix] Improve autotune from elementwise_add function in examples tile-ai/tilelang#1445)
[Language] Introduce T.annotate_restrict_buffers ([Language] Introduce T.annotate_restrict_buffers tile-ai/tilelang#1428)
[Analyzer] Require loop extent > 0 when entering loop ([Analyzer] Require loop extent > 0 when entering loop (#1012) tile-ai/tilelang#1451)
Updat ROCm CI to Nightly-ROCm-7.1 ([BugFix] Update CI to ROCm-7.1 tile-ai/tilelang#1449)
[Enhancement] Update examples and tests for improved type handling functionality ([Enhancement] Update examples and tests for improved type handling functionality tile-ai/tilelang#1448)
[Issue Template] Enable blank issues in GitHub issue template([Issue Template] Enable blank issues in GitHub issue template tile-ai/tilelang#1453)
[CI] Moved the clang-tidy step to after pip install ([CI] Moved the clang-tidy step to after pip install tile-ai/tilelang#1456)
[Bug] Fix tvm build script when patchelf is not found [Bug] Fix tvm build script when patchelf is not found tile-ai/tilelang#1459)
[Analyzer] Fix floordiv & floormod bug in z3 prover ([Analyzer] Fix floordiv & floormod bug in z3 prover tile-ai/tilelang#1458)
[Cache] Rename sparse compress cache directory ([Cache] Rename sparse compress cache directory tile-ai/tilelang#1460)
[Language]Adds a random number generation capability through curand_kernel ([Language]Adds a random number generation capability through curand_kernel tile-ai/tilelang#1461)
remove unused duplicated type check (remove unused duplicated type check tile-ai/tilelang#1462)
feat(cutedsl): add CuTeDSL backend (feat(cutedsl): add CuTeDSL backend tile-ai/tilelang#1421)
[Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py ([Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py tile-ai/tilelang#1464)
[ArgBinder] Enhance shape variable handling and assertions ([ArgBinder] Enhance shape variable handling and assertions tile-ai/tilelang#1467)
[Language] Make TL scripts friendly to Python syntax highlights ([Language] Make TL scripts friendly to Python syntax highlights tile-ai/tilelang#1466)
[Refactor] Remove triton dependence in testing & move triton baseline into examples ([Refactor] Remove triton dependence in testing & move triton baseline into examples tile-ai/tilelang#1470)
[Language] Enhance T.dtype.as_torch conversion for compatibility ([Language] Enhance T.dtype.as_torch conversion for compatibility tile-ai/tilelang#1473)
[News] update with latest news ([News] update with latest news tile-ai/tilelang#1475)
[Enhancement] Use static Z3 context ([Enhancement] Use static Z3 context tile-ai/tilelang#1482)
[Enhancement] Enhance let binding handling in layout inference and warp specialized pass ([Enhancement] Enhance let binding handling in layout inference and warp specialized pass tile-ai/tilelang#1484)
[Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy ([Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy tile-ai/tilelang#1486)
[Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator ([Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator tile-ai/tilelang#1491)
[CI] Add preformance regression test script ([CI] Add preformance regression test script tile-ai/tilelang#1489)
Pin nvidia-cutlass-dsl to 4.3.3 (Pin nvidia-cutlass-dsl to 4.3.3 tile-ai/tilelang#1497)
[Language] Remove ConstIf Frame for better meta programming ([Language] Remove ConstIf Frame for Better Meta-Programming tile-ai/tilelang#1496)
[CI] Fix concurrency bug in regression test workflow ([Bugfix][CI] Fix concurrency bug in regression test workflow tile-ai/tilelang#1500)
[Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers ([Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers tile-ai/tilelang#1495)
[Enhancement] Optimize MHA varlen fwd and support autotune ([Enhancement] Optimize MHA varlen fwd and support autotune tile-ai/tilelang#1499)
[Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type ([Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type tile-ai/tilelang#1474)
[Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled ([Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled tile-ai/tilelang#1502)
Update cutedsl docs and version check(Update cutedsl docs and version check tile-ai/tilelang#1503)
[Misc] configure pymarkdown ([Misc] configure pymarkdown tile-ai/tilelang#1505)
[Language] Fix gemm syntax highlight ([Language] Fix gemm syntax highlight tile-ai/tilelang#1476)
[Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi ([Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi tile-ai/tilelang#1511)
[Refactor] Phaseout execution_backend ctypes ([Refactor] Phaseout execution_backend ctypes tile-ai/tilelang#1510)
[Testing] Add Memory Leak Test ([Testing] Add Memory Leak Test tile-ai/tilelang#1516)
[Refactor] Support auto swizzling for tma store and phaseout related layout annotations ([Refactor] Support auto swizzling for tma store and phaseout related layout annotations tile-ai/tilelang#1509)
[CuTeDSL][Fix] thread safety + context safety ([CuTeDSL][Fix] thread safety + context safety tile-ai/tilelang#1513)
[BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI ([BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI tile-ai/tilelang#1515)
[Cleanup] Remove unnecessary macros in tilelang examples ([Cleanup] Remove unnecessary macros in tilelang examples tile-ai/tilelang#1514)
Fix ramp_lanes calculation in CUDA codegen (Fix ramp_lanes calculation in CUDA codegen tile-ai/tilelang#1518)
[Misc] add env for default target/backend/verbose ([Misc] add env for default target/backend/verbose tile-ai/tilelang#1512)
[Dtype] Improve host codegen handling for subtype ([Dtype] Improve host codegen handling for subtype tile-ai/tilelang#1517)
[Bugfix] Fallback to a Linear Layout instead of raising errors ([Bugfix] Fallback to a Linear Layout instead of raising errors tile-ai/tilelang#1521)
Use TargetIsCuda for all cuda target (Use TargetIsCuda for all cuda target tile-ai/tilelang#1522)
Fix fp4 pointer arithmetic in CUDA codegen (Fix fp4 pointer arithmetic in CUDA codegen tile-ai/tilelang#1524)
[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing ([Enhancement] Improve GitHub Actions permissions check and refine performance regression testing tile-ai/tilelang#1519)
[Release] Bump version into 0.1.7.post1 ([Release] Bump version into 0.1.7.post1 tile-ai/tilelang#1506)
[Pipeline] Refactor buffer allocation in Inject Pipeline Pass ([Pipeline] Refactor buffer allocation in Inject Pipeline Pass tile-ai/tilelang#1525)
[Dev] Fix when build local version with isolated build ([Dev] Fix when build local version with isolated build tile-ai/tilelang#1487)
[Bugfix] Skip stride check for subtype ([Bugfix] Skip stride check for subtype tile-ai/tilelang#1531)
[Lint] Enable whitespace and permission bit hooks ([Lint] Enable whitespace and permission bit hooks tile-ai/tilelang#1439)
[Enhancement][Tool] Tree-style pretty ASTPrinter ([Enhancement][Tool] Tree-style pretty ASTPrinter tile-ai/tilelang#1468)
[Fix] Add support for non-var complement arithmetic computation ([BUG] Layout Inference Fails for Cases Requiring Replication tile-ai/tilelang#1374) ([Fix] Add support for non-var complement arithmetic computation (#1374) tile-ai/tilelang#1533)
[BugFix] Complete vectorized loading for common dtypes ([BugFix] Complete vectorized loading for common dtypes tile-ai/tilelang#1536)
[Compat] Add CUDA version check for __nv_fp8_e8m0 type ([Compat] Add CUDA version check for __nv_fp8_e8m0 type tile-ai/tilelang#1537)
[Bug] Fix bugs of varlen attention forward examples caused by S_q != S_kv ([BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv tile-ai/tilelang#1530)
[Bug] Fix hanging from reduction on sm120 ([Bug] Fix hanging from reduction on sm120 tile-ai/tilelang#1540)
[example] use T.dynamic instead of tvm.te.var ([example] use T.dynamic instead of tvm.te.var tile-ai/tilelang#1538)
[Enhancement] Refactor KernelCache to use inheritance-based design ([Enhancement] Refactor KernelCache to use inheritance-based design tile-ai/tilelang#1483)
[Bugfix] Avoid considering local.var buffer as local ([Bugfix] Avoid considering local.var buffer as local tile-ai/tilelang#1541)
[Bugfix] Fix of T.Fill for local.var ([Bugfix] Fix of T.Fill for local.var tile-ai/tilelang#1543)
[Z3] Change z3 timeout to rlimit for determistic prove behavior ([Z3] Change z3 timeout to rlimit for determistic prove behavior tile-ai/tilelang#1542)
[Feat] Adapt gemm v2 for cutedsl backend ([Feat] Adapt gemm v2 for cutedsl backend tile-ai/tilelang#1544)
[Enhancement] Support larger H in deepseek sparse mla backward via split-H ([Enhancement] Support larger H in deepseek sparse mla backward via split-H tile-ai/tilelang#1548)
[Bugfix] Fix regression test to use installed package instead of source directory ([Bugfix] Fix regression test to use installed package instead of source directory tile-ai/tilelang#1550)
[Refactor] Introduce layout annotations for ParallelOPNode and CopyNode ([Refactor] Introduce layout annotations for ParallelOPNode and CopyNode tile-ai/tilelang#1539)
[Script] Provide regression test script to help benchmark regression in local env ([Script] Provide regression test script to help benchmark regression in local env tile-ai/tilelang#1551)
[Typing] Update Kernel signature and add type hints for buffer operations ([Typing] Update Kernel signature and add type hints for buffer operations tile-ai/tilelang#1545)
[CI]: Bump actions/upload-artifact from 4 to 6 ([CI]: Bump actions/upload-artifact from 4 to 6 tile-ai/tilelang#1555)
Use cuda capability from torch to be more generic ([Refactor] Use cuda capability from torch to be more generic tile-ai/tilelang#1557)
[CI]: Bump actions/github-script from 7 to 8 ([CI]: Bump actions/github-script from 7 to 8 tile-ai/tilelang#1556)
[Host] Provide post process to customize host code and enhance nullable check ([Host] Provide post process to customize host code and enhance nullable check tile-ai/tilelang#1562)
[Release] Build tilelang against CUDA 13.1 in CI ([Release] Build tilelang against CUDA 13.1 in CI tile-ai/tilelang#1532)
[LazyJIT] Move Type Annotations to Function Body ([LazyJIT] Move Type Annotations to Function Body tile-ai/tilelang#1480)
[bugfix] fix missing logic for clear_accum ([bugfix] fix missing clear_accum logic for gemm_sp_v2 tile-ai/tilelang#1563)
[Misc] Remove unused tl_pipeline_sync. ([Misc] Remove unused tl_pipeline_sync. tile-ai/tilelang#1566)
[Refactor] Improve scalarization handling in vectorization logic ([Refactor] Improve scalarization handling in Pass VectorizeLoop tile-ai/tilelang#1565)
[Refactor] Simplify do_bench calls by using default warmup and rep parameters ([Refactor] Simplify do_bench calls by using default warmup and rep parameters tile-ai/tilelang#1568)
[CI] Refactor PR regression test job conditions ([CI] Refactor PR regression test job conditions tile-ai/tilelang#1569)
[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition ([Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition tile-ai/tilelang#1559)
[Refactor] Enhance deterministic ordering in shared memory allocation merge. ([Refactor] Enhance deterministic ordering in shared memory allocation merge. tile-ai/tilelang#1570)
[Enhancement] Improve equality checks in layout nodes and fragment validation ([Enhancement] Improve equality checks in layout nodes and fragment validation tile-ai/tilelang#1573)
[Feature] add kUseCooperativeLaunch tag for tvm_ffi ([Feature] add kUseCooperativeLaunch tag for tvm_ffi tile-ai/tilelang#1572)
[Refactor] Remove unnecessary logging configuration in Analyzer.py ([Refactor] Remove unnecessary logging configuration in Analyzer.py tile-ai/tilelang#1574)
[Release] Bump version to 0.1.7.post2 ([Release] Bump version to 0.1.7.post2 tile-ai/tilelang#1575)
[BugFix] Change default rounding mode for fp4 conversions ([BugFix] Change default rounding mode for fp4 conversions tile-ai/tilelang#1580)
[CI] Add CUDA-aware pytest scheduler + auto workers ([CI] Add CUDA-aware pytest scheduler + auto workers tile-ai/tilelang#1584)
[Enhancement] Improve performance regression output with timing and streaming ([Enhancement] Improve performance regression output with timing and streaming tile-ai/tilelang#1585)
[Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter ([Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter tile-ai/tilelang#1589)
Add PrimExpr substitution support for AttrStmt nodes ([BugFix] Add PrimExpr substitution support for AttrStmt nodes tile-ai/tilelang#1583)
** [BugFix] fix tcgen5mma example ([BugFix] fix tcgen5mma example tile-ai/tilelang#1577)**
[Refactor] Use access_ptr instead of buffer and offsets for cp async params ([Refactor] Use access_ptr instead of buffer and offsets for cp async params tile-ai/tilelang#1590)
[Layout] Support annotating loop layout in frontend ([Layout] Support annotating loop layout in frontend tile-ai/tilelang#1579)
[Typo] Rename loop layout annotation test([Typo] Rename loop layout annotation test tile-ai/tilelang#1596)
[Fix] Add register to read A ptr in test_tilelang_language_cooperative.py ([Fix] Add register to read A ptr in test_tilelang_language_cooperative.py tile-ai/tilelang#1593)
[Feat] PDL Support ([Feat] PDL Support tile-ai/tilelang#1494)
[Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype ([Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype tile-ai/tilelang#1599)
[Fix][CuteDSL] add support for tanh/tanhf (fixes [BUG] CuTe-DSL backend wrongly converts tanh to tanhf(op) as opposed to tanh(op, fastmath=True) tile-ai/tilelang#1595) ([Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) tile-ai/tilelang#1597)
[Release] Fix race condition when publishing ([Release] Fix race condition when publishing tile-ai/tilelang#1578)
Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 (Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 tile-ai/tilelang#1600)
[Enhancement][AMD] Add preshuffle fp8 gemm example on amd. ([Enhancement][AMD] Add preshuffle fp8 gemm example on amd. tile-ai/tilelang#1605)
[Bugfix] Mangle Single Precision Mathematical Functions of cuda math api ([Bugfix] Mangle Single Precision Mathematical Functions of cuda math api tile-ai/tilelang#1602)
[Bugfix] Open Rocm ci test and fix some bugs. ([Bugfix] Open Rocm ci test and fix some bugs. tile-ai/tilelang#1443)
[Feature] Add more curand operations & support vectorization ([Feature] Add more curand operations & support vectorization tile-ai/tilelang#1582)
[Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries ([Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries tile-ai/tilelang#1481)
[BugFix] Add pre-commit to requirements-dev.txt ([BugFix] Add pre-commit to requirements-dev.txt tile-ai/tilelang#1611)
[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop ([BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop tile-ai/tilelang#1607)
[Feat] Add strong checker to detect data racing in T.Parallel ([Feat] Add strong checker to detect data racing in T.Parallel tile-ai/tilelang#1615)
[Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin ([Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin tile-ai/tilelang#1614)
[RaceChecker] RaceChecker report warning rather than error for backward compatibility ([RaceChecker] RaceChecker report warning rather than error for backward compatibility tile-ai/tilelang#1620)
[Fix] Update type hint handling for Python 3.10 compatibility in get_type_hints function ([BugFix] Fix ForwardRef usage in v2 frontend (#1619) tile-ai/tilelang#1621)
[Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse ([Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse tile-ai/tilelang#1622)
[Feat] Improve T.reduce_absmax to use less abs call ([Feat] Improve T.reduce_absmax to use less abs call tile-ai/tilelang#1626)
[Bugfix] Do not consider local.var as local buffer during LowerTileOP ([Bugfix] Do not consider local.var as local buffer during LowerTileOP tile-ai/tilelang#1628)
[Feature] Add hoist_broadcast_values pass ([Feature] Add hoist_broadcast_values pass tile-ai/tilelang#1606)
[Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc ([Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc tile-ai/tilelang#1528)
[Bugfix] Fallback into full region when dynamic buffer read region cannot be proved ([Bugfix] Fallback into full region when dynamic buffer read region cannot be proved tile-ai/tilelang#1618)
[Feat] Allow print macro call stack in device assert ([Feat] Allow print macro call stack in device assert tile-ai/tilelang#1616)
[BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr ([BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr tile-ai/tilelang#1627)
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3 ([Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 tile-ai/tilelang#1636)
[Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt ([Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt tile-ai/tilelang#1638)
[Refactor] Update main function signatures in example scripts to accept parameters directly ([Feature] Reimplement Threadsync with Constr_Visitor tile-ai/tilelang#1630) ([Refactor][CI] Reduce sparse related test time tile-ai/tilelang#1637)
[Refactor] Unify @jit and @lazy_jit into a single @jit decorator ([Refactor] Unify @jit and @lazy_jit into a single @jit decorator tile-ai/tilelang#1632)
[Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen ([Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen tile-ai/tilelang#1650)
[Bugfix] reverted unexpected tvm changes ([Bugfix] reverted unexpected tvm changes tile-ai/tilelang#1651)
[Bugfix] reverted unexpected tvm changes ([Bugfix] reverted unexpected tvm changes tile-ai/tilelang#1652)
[Refactor] Move dtypes.py from eager to language and add bits/bytes properties ([Refactor] Move dtypes.py from eager to language and add bits/bytes properties tile-ai/tilelang#1646)
[Feat] Allow dangling producer in wasp pipeline planning ([BUG] Another regression: "variables X are used, but are not passed in as API arguments" error in T.Pipelined tile-ai/tilelang#1263) ([Feat] Allow dangling producer in wasp pipeline planning (#1263) tile-ai/tilelang#1647)
[bugfix] fix smem alloc for single warp reduce ([bugfix] fix smem alloc for single warp reduce tile-ai/tilelang#1643)
[Example] Add attention sink varlen examples ([Example] Add attention sink varlen examples tile-ai/tilelang#1645)
[ASTPrinter] Fix IfThenElse printing and some format problems ([ASTPrinter] Fix IfThenElse printing and some format problems tile-ai/tilelang#1640)
[CI] [pre-commit.ci] autoupdate ([CI] [pre-commit.ci] autoupdate tile-ai/tilelang#1610)
[Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides ([Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides tile-ai/tilelang#1649)
[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py ([Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py tile-ai/tilelang#1634)
[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility ([CUDA] Introduce simulated load/store 256bits access for CUDA compatibility tile-ai/tilelang#1656)
[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case ([Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case tile-ai/tilelang#1654)
[Bugfix] Fix missing annotations for default CallNode Visitor ([Bugfix] Fix missing annotations for default CallNode Visitor tile-ai/tilelang#1659)
[Clean] Remove unnecessary debug print ([Clean] Remove unnecessary debug print tile-ai/tilelang#1661)
[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies ([Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies tile-ai/tilelang#1657)
[Refactor] Improve CallNode handling to include annotations in various operations ([Refactor] Improve CallNode handling to include annotations in various operations tile-ai/tilelang#1663)
[EagerJIT] Add Support for Parameter Only Kernel Compilation ([EagerJIT] Add Support for Parameter Only Kernel Compilation tile-ai/tilelang#1664)
[AutoDD] Add Tilelang AutoDD to Reduce Buggy Program ([AutoDD] Add Tilelang AutoDD to Reduce Buggy Program tile-ai/tilelang#1639)
[Feature] Support cp.reduce.async.bulk.tensor ([Feature] Support cp.reduce.async.bulk.tensor tile-ai/tilelang#1667)
chore: update CI cutedsl version to 4.3.5 (chore: update CI cutedsl version to 4.3.5 tile-ai/tilelang#1665)
[CUDA] Enhance Broadcast Codegen for Symbolic Value ([CUDA] Enhance Broadcast Codegen for Symbolic Value tile-ai/tilelang#1669)
[EagerJIT] Fix bug in handling of positional arguments ([EagerJIT] Fix bug in handling of positional arguments tile-ai/tilelang#1675)
[Feature] Reimplement Threadsync with ConstrVisitor ([Feature] Reimplement Threadsync with ConstrVisitor tile-ai/tilelang#1631)
[Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer ([Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer tile-ai/tilelang#1672)
[Feature] Atomic Reduction Operations and Vectorization Enhancement ([Feature] Atomic Reduction Operations and Vectorization Enhancement tile-ai/tilelang#1676)
[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass ([Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass tile-ai/tilelang#1677)
[Bugfix] Relax region analysis for complex expression ([Bugfix] Relax region analysis for complex expression tile-ai/tilelang#1679)
[Example] Add example for mHC inference kernels. ([Example] Add example for mHC inference kernels. tile-ai/tilelang#1684)
[Analyzer] Fix missing assume in tvm analyzer ([Analyzer] Fix missing assume in tvm analyzer tile-ai/tilelang#1680)
Refactor: Use centralized do_bench from tilelang.profiler (Refactor: Use centralized do_bench from tilelang.profiler tile-ai/tilelang#1670)
[Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization ([Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization tile-ai/tilelang#1644)
[Release] Bump Version into v0.1.7.post3 ([Release] Bump Version into v0.1.7.post3 tile-ai/tilelang#1685)
[Release] Fix release wheels ([Release] Fix release wheels tile-ai/tilelang#1687)
[BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug ([BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug tile-ai/tilelang#1588)
[Bugfix] Reorganize pass for thread_sync ([Bugfix] Reorganize pass for thread_sync tile-ai/tilelang#1682)
[BugFix] fix warning on deepseek_v32 topk_selector.py ([BugFix] fix warning on deepseek_v32 topk_selector.py tile-ai/tilelang#1681)
[tvm-ffi] Enable tvm-ffi for metal backend ([tvm-ffi] Enable tvm-ffi for metal backend tile-ai/tilelang#1289)
[Analyzer] Fix missing assume in tvm analyzer ([Analyzer] Fix missing assume in tvm analyzer tile-ai/tilelang#1695)
[Chore] Use python-side control flow keywords in examples for consistency ([Chore] Use python-side control flow keywords in examples for consistency tile-ai/tilelang#1692)
[Bugfix][Refactor] Always disable light storage reuse ([Bugfix][Refactor] Always disable light storage reuse tile-ai/tilelang#1691)
[Enhancement] Log warnings for OOB acceses to non-global buffers ([Enhancement] Log warnings for OOB acceses to non-global buffers tile-ai/tilelang#1693)
Enhance loop vectorization logic for CallNode handling (Enhance loop vectorization logic for CallNode handling tile-ai/tilelang#1696)
[BugFix] Fix JITKernel export_library bug ([BugFix] Fix JITKernel export_library bug tile-ai/tilelang#1699)
[Enhancement] Handle vectorizable calls ([Enhancement] Handle vectorizable calls tile-ai/tilelang#1700)
[BugFix] Fix unsafe visit else case under WarpSpecializationScope ([BugFix] Fix unsafe visit else case under WarpSpecializationScope tile-ai/tilelang#1702)
[Enhancement] Use cute::elect_one_sync() for slightly better performance ([Enhancement] Use cute::elect_one_sync() for slightly better performance tile-ai/tilelang#1703)
[Enhancement] Remove RewriteUnsafeSelect Pass ([Enhancement] Remove RewriteUnsafeSelect Pass tile-ai/tilelang#1705)
[BugFix] Corrected when proving loop layout contains a fragment buffer layout ([BugFix] Corrected when proving loop layout contains a fragment buffer layout tile-ai/tilelang#1708)
[Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout ([Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout tile-ai/tilelang#1709)
[BugFix] Add int64_t support for AtomicAdd ([BugFix] Add int64_t support for AtomicAdd tile-ai/tilelang#1716)
[Refactor] Introduce GemmInst enumeration and update warp partitioning logic ([Refactor] Introduce GemmInst enumeration and update warp partitioning logic tile-ai/tilelang#1707)
[Refactor] Phaseout unnecessary checks for pr [Refactor] Introduce GemmInst enumeration and update warp partitioning logic tile-ai/tilelang#1707 ([Refactor] Phaseout unnecessary checks for pr #1707 tile-ai/tilelang#1721)
[Refactor] re-implement vector subtype and its access method ([Refactor] re-implement vector subtype and its access method tile-ai/tilelang#1722)
[EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT ([BUG] Tilelang Reduction Ops is Conflict with EagerJIT tile-ai/tilelang#1690) ([EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) tile-ai/tilelang#1694)
[Enhancement] Legalize subtype access ([Enhancement] Legalize subtype access tile-ai/tilelang#1724)
[EagerJIT] Enhance auto inference of lazyjit and eager jit ([EagerJIT] Enhance auto inference of lazyjit and eager jit tile-ai/tilelang#1704)
[Refactor] Enhance variable substitution in device function generation ([Refactor] Enhance variable substitution in device function generation tile-ai/tilelang#1723)
[Bugfix] Fix incorrect alignment of vectorized subtype ([Bugfix] Fix incorrect alignment of vectorized subtype tile-ai/tilelang#1726)
[Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) ([Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) tile-ai/tilelang#1717)
[Refactor] Remove external buffer conflict check in pipeline injection ([Refactor] Remove external buffer conflict check in pipeline injection tile-ai/tilelang#1727)
[Refactor] Relocate layout transformation of ptx_stmatrix ([Refactor] Relocate layout transformation of ptx_stmatrix tile-ai/tilelang#1689)
[AMD] Add MI350/MI355 FP8 support ([AMD] Add MI350/MI355 FP8 support tile-ai/tilelang#1718)
[Bugfix] revert incorrect fast path for parallel layout inference ([Bugfix] revert incorrect fast path for parallel layout inference tile-ai/tilelang#1730)
[Example] Add KDA algorithm implementation in tilelang ([Example] Add KDA algorithm implementation in tilelang tile-ai/tilelang#1660)
[Feature] Support E8M0 related type conversion and vectorized cast ([Feature] Support E8M0 related type conversion and vectorized cast tile-ai/tilelang#1731)
[BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 ([BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 tile-ai/tilelang#1735)
Add swizzle layout detection and automatic merging for layout conflicts (Add swizzle layout detection and automatic merging for layout conflicts tile-ai/tilelang#1736)
[Bugfix] Handle offset handling for subtype ptr ([Bugfix] Handle offset handling for subtype ptr tile-ai/tilelang#1738)
[EagerJIT] Allow dummy parameter in jit kernel ([EagerJIT] Allow dummy parameter in jit kernel tile-ai/tilelang#1737)
[Feature] Add build date to version metadata ([Feature] Add build date to version metadata tile-ai/tilelang#1742)
[BugFix] Fix FP4 related vectorized cast ([BugFix] Fix FP4 related vectorized cast tile-ai/tilelang#1741)
[Refactor] Disable Predicated LDG PTX Lowering by default ([Refactor] Disable Predicated LDG PTX Lowering by default tile-ai/tilelang#1739)
[Layout] Fix Layout Bugs in Parallel and Reduce ([Layout] Fix Layout Bugs in Parallel and Reduce tile-ai/tilelang#1713)
[fix]: fix deepseek_mla amd example and add aiter mla compare test ([fix]: fix deepseek_mla amd example and add aiter mla compare test tile-ai/tilelang#1740)
[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics ([Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics tile-ai/tilelang#1733)
[BugFix] Fix several bugs in CodeGen for CuTeDSL backend ([BugFix] Fix several bugs in CodeGen for CuTeDSL backend tile-ai/tilelang#1746)
Update import for compare_tensors from test_utils_kda (Update import for compare_tensors from test_utils_kda tile-ai/tilelang#1748)
[Lint] Remove diff arguments in Ruff and sync some versions ([Lint] Remove diff arguments in Ruff and sync some versions tile-ai/tilelang#1751)
[Refactor] Rename EagerJIT examples to avoid confusion ([Refactor] Rename EagerJIT examples to avoid confusion tile-ai/tilelang#1750)
[AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 ([AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 tile-ai/tilelang#1743)
[Feature] Support message-only debug print ([Feature] Support message-only debug print tile-ai/tilelang#1755)
[EagerJIT] Update README example to eager jit ([EagerJIT] Update README example to eager jit tile-ai/tilelang#1752)
[BugFix] Stride check and fix for tensors with zero-stride argument ([BugFix] Stride check and fix for tensors with zero-stride argument tile-ai/tilelang#1749)
[BugFix] Always build guard in loop partitioning to prevent out-of-bounds access ([BugFix] Always build guard in loop partitioning to prevent out-of-bounds access tile-ai/tilelang#1756)
[Tool] Add tool to print fragment in thread value view ([Tool] Add tool to print fragment in thread value view tile-ai/tilelang#1759)
[Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking ([Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking tile-ai/tilelang#1753)
[ThreadSync] Use Z3 for constraint equivalence checking ([ThreadSync] Use Z3 for constraint equivalence checking tile-ai/tilelang#1760)
[Feature] Support Pass LoopUnswitching ([Feature] Implement LoopUnswitching Pass tile-ai/tilelang#1747)
[Chore] Remove unnecessary log from z3 ([Chore] Remove unnecessary log from z3 tile-ai/tilelang#1763)
[Bugfix] Revert the initial value of Z3 SetRLimit ([Bugfix] Revert the initial value of Z3 SetRLimit tile-ai/tilelang#1765)
[Feature] Enhance Loop Unswitching with Let Binding and Condition Handling ([Feature] Enhance Loop Unswitching with Let Binding and Condition Handling tile-ai/tilelang#1766)
[Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass ([Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass tile-ai/tilelang#1767)
[Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass ([Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass tile-ai/tilelang#1769)
[Fix] Change ue8m0 default round mode to cudaRoundPosInf ([Fix] Change ue8m0 default round mode to cudaRoundPosInf tile-ai/tilelang#1770)
[Feature] Support tcgen5mma lowering for .kind::i8 ([Feature] Support tcgen5mma lowering for .kind::i8 tile-ai/tilelang#1764)
[Refactor] Unify the usage of cast-related operators ([Refactor] Unify the usage of cast-related operators tile-ai/tilelang#1757)
[Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations ([Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations tile-ai/tilelang#1776)
[CI] [pre-commit.ci] autoupdate ([CI] [pre-commit.ci] autoupdate tile-ai/tilelang#1775)
[Refactor] Improve type annotations and reduce some lint errors in frontend ([Refactor] Improve type annotations and reduce some lint errors in frontend tile-ai/tilelang#1777)
Update TVM: fix select/if_then_else out-of-bounds access (Update TVM: fix select/if_then_else out-of-bounds access tile-ai/tilelang#1783)
[Feature] Add fully replicated layout interface in annotation layout ([Feature] Add fully replicated layout interface in annotation layout tile-ai/tilelang#1772)
[Example][BugFix] Fix arguements override in deepseek_v32 topk_selector ([Example][BugFix] Fix arguements override in deepseek_v32 topk_selector tile-ai/tilelang#1784)
[BugFix] Fix reduce_sum with clear=False not accumulating correctly ([BugFix] Fix reduce_sum with clear=False not accumulating correctly tile-ai/tilelang#1778)
fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter (fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter tile-ai/tilelang#1786)
[Enhancement] Enhance register vectorize inference ([Enhancement] Enhance register vectorize inference tile-ai/tilelang#1785)
[Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read ([Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read tile-ai/tilelang#1781)
[Fix] cython 3.0 generates incorrect code for python stable api ([Fix] cython 3.0 generates incorrect code for python stable api tile-ai/tilelang#1789)
[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly ([BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly tile-ai/tilelang#1794)
[ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis ([ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis tile-ai/tilelang#1795)
[Feature] Add option to disable out-of-bound access warnings in safe memory access legalization ([Feature] Add option to disable out-of-bound access warnings in safe memory access legalization tile-ai/tilelang#1797)
[Docs] Add Python Compatibility document of TileLang ([Docs] Add Python Compatibility document of TileLang tile-ai/tilelang#1745)
[Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils ([Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils tile-ai/tilelang#1779)
[Feature] Support passing PrimExpr value in tile-level atomic operation ([Feature] Support passing PrimExpr value in tile-level atomic operation tile-ai/tilelang#1796)
[Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined ([Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined tile-ai/tilelang#1799)
[BugFix] Missing Recursive Loop Var Checking in Loop Unswitching ([BugFix] Missing Recursive Loop Var Checking in Loop Unswitching tile-ai/tilelang#1801)
Fix a 3.9 issue. add _typing.py to dist check (Fix a 3.9 issue. add _typing.py to dist check tile-ai/tilelang#1803)
[Docs][Puzzles] Add TileLang puzzles in README ([Docs][Puzzles] Add TileLang puzzles in README tile-ai/tilelang#1806)
[Docs] Hotfix wrong link ([Docs] Hotfix wrong link tile-ai/tilelang#1807)
[Enhancement] Improve plot_layout visualization for Layouts ([Enhancement] Improve plot_layout visualization for Layouts tile-ai/tilelang#1811)
[Feat] profiler support cudagraph backend ([Feat] profiler support cudagraph backend tile-ai/tilelang#1658)
Handle staled autotune state with tvm-ffi adapter. (Handle staled autotune state with tvm-ffi adapter. tile-ai/tilelang#1812)
[BugFix] LoopUnswitching: gate non-trivial else behind PassConfig ([BugFix] LoopUnswitching: gate non-trivial else behind PassConfig tile-ai/tilelang#1816)
[Release] Update dependencies to resolve several issues ([Release] Update dependencies to resolve several issues tile-ai/tilelang#1817)
[Enhancement] Integrate arith::Analyzer into Loop Vectorizer for improved analysis

…ile-ai#1445) * Remove JIT decorator from elementwise_add function in examples * fix kernel compilation without autotune * Refactor main function to accept parameters and update tests for autotune option * Refactor autotune test function for morden style

* [Enhancement] Introduce non-restrict parameter support in code generation - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation. - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases. - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict. - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions. - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively. * [Enhancement] Improve HoistNonRestrictParams transformation - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement. - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters. - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels. - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process. * [Refactor] Simplify non-restrict parameter handling in code generation - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`. - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability. - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability. * [Dependency] Update TVM subproject to latest commit 68aa8461 - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements. - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability. - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters. * fix

…nctionality (tile-ai#1448) * [Enhancement] Update examples and tests for improved type handling and functionality - Enhanced various example scripts to support new data types and improve compatibility with PyTorch. - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling. - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management. - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling. * [Refactor] Update accumulation data type to float32 across examples - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability. - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board. * [Refactor] Standardize data type usage across benchmark scripts - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling. - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard. - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules. * [Refactor] Standardize data type usage in templates and scripts - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity. - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates. - This change aims to streamline type handling and improve compatibility with existing workflows. * [Refactor] Standardize data type usage in examples and benchmarks - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard. - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples. * [Refactor] Import dtypes from language.v2 module - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase. - This change aims to streamline data type management and improve overall code clarity. * fix * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity. - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability. - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase. * [Refactor] Update data type handling for consistency and clarity - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency. - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase. - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling. - This refactor aims to streamline data type management and improve overall code clarity and maintainability. * [Enhancement] Improve data type handling and error messaging - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation. - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs. - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience. * [Fix] Correct boolean flag in GEMM SP test case - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case. - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation. * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability. * [Refactor] Standardize data type usage in various modules - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability. - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations. * [Refactor] Update argument parsing for data types in benchmarks - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float. - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability. * [Refactor] Update data type handling in benchmark and example scripts - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency. - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase. - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples. * [Refactor] Fix data type conversion in multiple scripts - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts. - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations. * [Refactor] Update float8 data type usage across multiple scripts - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling. - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations. * [Refactor] Enhance float8 data type handling in CUDA code generation - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic. - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase. - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Streamline float8 data type handling in CUDA and related modules - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks. - Updated layout inference for float8 types to improve clarity and maintainability across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Remove unnecessary cache disabling in float8 example script - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code. - This change enhances clarity and maintainability of the example script without affecting its functionality. * [Refactor] Update data type usage in debug print tests - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type. - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability. * lint fix * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.

…#1453)

* fix floordiv & floormod in z3 prover * fix lint error

* Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions. * Fix formatting in sparse.py by adding a newline for improved readability and consistency.

…ernel (tile-ai#1461) * add curand.{curand_init, curand} * run format.sh * add default value for curand_init & add test for curand * Update testing/python/language/test_rand.py Remove unused thread binding Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * remove unused library * enable tilelang cache for testing * run format.sh * Revert "run format.sh" This reverts commit 5afaff7. * Revert "enable tilelang cache for testing" This reverts commit c277a43. * Revert "remove unused library" This reverts commit 568ad20. * run format.sh * ensure FreshName for __philox_state * ensure FreshName for __philox_state * change the return type of T.rng_init --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

* feat: CuTeDSL backend * fix: clang-tidy * fix: clang-format * fix: ci * fix: revert example gemm fp8 * fix: remove duplicate code * fix: switch-case * fix: fp16 silence * fix: TVM IR print * fix: useless tir * fix: clang-format * fix: remove tilelang/contrib/cutedsl/.gitignore * fix: use hexfloat * fix: gsym guard * fix: unknown storage sync type * fix: string literal * fix: add args guard * fix: name hint dedup * fix: better find_kernel_by_pattern * fix: set libpath for from_database path * fix: guard buffer.strides * fix: from guard * fix: eviction guard * fix: use thread local tma descs * fix: ruff * fix: drop tma_init_cpp * fix: exc_info * fix: negative unmatch early return * fix: rename postproc func and add test * fix: handle fast math according to pass config * fix: dyn_sym parse * fix: wrap_forward * fix: use tvm_ffi.libinfo instead of cli * fix: keep signature * fix: C++ string safety * fix: mark tma_store_add as unsupported * fix: tvm version * resolve ldsm and cpasync issues. * fix: minor fixes * fix: parse signature using ast * fix: guard global_addr * fix: create tempfile only when necessary * fix: use logger.execption for exceptions * fix: guard lib_path and host_func * fix: remove tma_cpp_init and add timeout for cpp compile * add timeout for mbarrier_wait. * fix: _load_kernel_from_disk signature * resolve codegen issues. * fix: logger.exception * add comment for div_by=1 * merge * fix: reserve cutlass,cute,tl * fix: guard tma_store * fix: allow int64 offset in make_tensor_at_offset * fix: guard barrier * fix: add comments for div_by=16 * fix: div_by=1 issue * delete div_by when offset is 0 * use tl.make_tensor when offset is 0 * fix: explicitly check cutedsl target * fix: use param.torch_dtype() --------- Co-authored-by: yuxic <yuxic@nvidia.com> Co-authored-by: Yong <yong@local> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

…lang_language_rand.py` (tile-ai#1464) * rename test for curand & add triton baseline * add a comment for calling T.rng_rand() four times * refactor tilelang&triton kernel * Add boundary checks for M not divisible by 128

) * feat(arg_binder): enhance shape variable handling and assertions - Implemented special handling for comparing if_then_else expressions to simplify conditions involving NULL checks. - Added methods to set shared shape variables and finalize deferred bindings, generating cascading if_then_else expressions and runtime assertions for non-NULL buffers. - Updated the binding logic to defer shape variable bindings for shared variables, ensuring proper handling across multiple nullable buffers. * refactor(arg_binder): clean up shape variable handling and remove unused code - Removed deprecated methods for setting shared shape variables and finalizing deferred bindings, streamlining the argument binding process. - Simplified the logic for handling shape values in the `BindDLTensor` function, ensuring immediate binding for normal shape variables. - Enhanced clarity by eliminating unnecessary comments and code related to cascading if_then_else expressions for shared variables. * refactor(arg_binder): enhance DLTensor binding with improved shape handling - Replaced the single `BindDLTensor` method with `BindDLTensors` to support multiple buffers, improving flexibility in handling DLTensor bindings. - Introduced a two-pass approach for shape variable handling, allowing for better management of symbolic dimensions and null checks. - Updated the logic to assert non-null conditions at runtime and utilize cascaded if_then_else expressions for shape retrieval, enhancing robustness. - Removed deprecated code and streamlined the binding process for clarity and maintainability. * fix(test_nullable_buffer_params): improve formatting and consistency in test output - Updated string formatting for better readability in the `test_nullable_shared_shape` function. - Ensured consistent use of double quotes for string literals. - Added a missing newline at the end of the file for proper formatting. * refactor(arg_binder): simplify allocation size calculation in BindDLTensors - Streamlined the calculation of allocation size by replacing a lambda function with a direct loop, enhancing readability and maintainability. - Improved clarity in the null check message for data pointers, ensuring better understanding of the binding process. * Remove debug prints from phase.py Removed debug print statements after MakePackedAPI transformation.

…-ai#1466) * Language] Make TL scripts friendly to Python syntax highlights * add comments * fix submodule

… into examples (tile-ai#1470) * remove triton dependence in testing & move triton baseline into example * use ceildiv and handles arbitrary M correctly for triton

…e-ai#1473) * [Language] Enhance dtype conversion for PyTorch compatibility - Added support for new float8 and float4 data types in the __dtype_as_torch__ method. - Implemented backend-specific handling for float8_e4m3 based on HIP or CUDA. - Included assertions to ensure compatibility with the required PyTorch versions for each dtype. - Improved error handling for unsupported dtypes. * Fix test script execution and improve error messages for dtype assertions - Commented out the main execution call in the test script and replaced it with a direct call to the test function `test_divmod()`. - Enhanced error messages in the dtype conversion assertions to improve clarity and readability, ensuring proper guidance for required PyTorch versions.

* Update README.md with latest news, including CuTeDSL backend support, Z3 theorem prover integration, and migration to apache-tvm-ffi for improved compatibility. * Update README.md to enhance CuTeDSL backend announcement with a link to related issue and clarify migration benefits to apache-tvm-ffi, reducing CPU overhead.

* use static Z3 context * Update submodule reference for TVM to indicate a dirty state

…rp specialized pass (tile-ai#1484) * [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference * Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer. * Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference. * Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy. * Added documentation for new methods and updated existing methods to support the new layout features. * lint fix * Remove debug logging statements from layout inference process to streamline output and improve performance.

…icAlignment` as they are legacy (tile-ai#1486) * [Cleanup] Remove dynamic shape example and related tests * Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase. * Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`. * Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability. * build fix

…Evaluator (tile-ai#1491) * [Cleanup] Remove dynamic shape example and related tests * Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase. * Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`. * Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability. * build fix * Update submodule reference for TVM to latest commit 315036dc * phaseout z3

* [Feature]: Add benchmark scripts for examples * apply cupti * fix * format * initial commit * fix * upd * upd * lint * fix * fake * Simplify PR regression test workflow Removed redundant 'Clean pip environment' steps from the workflow. * Update test_perf_regression.py * Enhance regression test bot workflow file handling Updated the GitHub Actions workflow to improve file handling for the regression test report. * Update regression test workflow for artifact naming * Update pr-regression-test-bot.yml * fix * lint * Update performance regression test trigger conditions --------- Co-authored-by: yyttt6 <1652272478@qq.com>

…1496)

Updated concurrency group to use issue/PR number.

…troduce processing for floating fragment buffers (tile-ai#1495) * [Refactor] Replace local allocations with variable allocations in various examples and operations * Updated multiple files to replace local buffer allocations with variable allocations for improved performance and clarity. * Changed `alloc_local` to `alloc_var` in examples related to attention mechanisms, deep learning models, and GEMM operations. * Enhanced code readability and maintainability by streamlining buffer management across different components. * Ensured consistent handling of buffer scopes and types throughout the codebase. * typo fix * test fix * [Refactor] Simplify index handling in sparse MLA forward pipelined example * Updated index handling in `sparse_mla_fwd_pipelined.py` to eliminate unnecessary local array usage, improving code clarity and performance. * Replaced instances of `indices_local[0]` with direct usage of `indices_local` for better readability and consistency in buffer access. * Commented out the main execution call in the GDN test script to focus on the specific test function, enhancing test clarity. * lint fix

) * [Enhancement] Optimize MHA varlen fwd and support autotune * use fa2 instead of fa3 as baseline in ci

…upported FP8 type (tile-ai#1474) * Refactor CUDA vectorized cast generation and remove unsupported FP8 type * test fix * lint fix * Refactor CUDA vectorized cast function naming for clarity * Add support for float4_e2m1fn type conversions in CUDA vectorized casts - Implemented conversions between float4_e2m1fn and float32, half2, and float2 in utils.cc and cuda_fp4.h. - Updated test_tilelang_language_vectorized_cast.py to validate new conversions and ensure correctness. - Enhanced dtype conversion in dtypes.py to handle float4_e2m1fn appropriately, logging a warning for unsupported types in PyTorch. * Enhance vectorized cast tests for new data types - Added tests for vectorized casting of float8 and float4 data types, ensuring compatibility with CUDA compute versions. - Refactored existing test functions to improve clarity and organization, separating tests for different data types. - Updated parameterization to include additional test cases for new conversions. --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com> Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

* [Refactor] Unify the usage of cast-related operators * reinterpret auto detect

…IT compilations (tile-ai#1776) Refactor pass_configs initialization in JITKernel to ensure a new dictionary is created if pass_configs is not None. This change improves clarity and prevents potential issues with mutable default arguments.

* [CI] [pre-commit.ci] autoupdate updates: - [github.com/astral-sh/ruff-pre-commit: v0.14.11 → v0.14.14](astral-sh/ruff-pre-commit@v0.14.11...v0.14.14) - [github.com/jackdewinter/pymarkdown: v0.9.34 → v0.9.35](jackdewinter/pymarkdown@v0.9.34...v0.9.35) * sync requirements-lint.txt --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: SiriusNEO <chaofan@deepseek.com>

…ontend (tile-ai#1777) * temp * temp * [Refactor] Improve type annotations and reduce some lint errors * some fixes * update * update * address comments * address comments * fix print * address comments * refactor typing to _typing * fix more * fix reduce * no return * fix * fix cumsum

Update TVM submodule: fix select/if_then_else OOB access Update TVM to include fix for out-of-bounds memory access when if_then_else is nested inside select during code generation. See: tile-ai/tvm#26 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…ile-ai#1772) * [Feature] Add fully replicated layout interface in annotation layout * Lint * Remove test for issue 1729 from the tilelang testing suite --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

…or (tile-ai#1784) [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector example.

…ile-ai#1778) * Fix type annotations for T.reshape and T.view * Fix issue tile-ai#1666: reduce_sum with clear=False not accumulating correctly * address comments and add testcases * add more tests --------- Co-authored-by: SiriusNEO <chaofan@deepseek.com>

…er (tile-ai#1786)

* fix * simplify the constraint * fix docs --------- Co-authored-by: SiriusNEO <chaofan@deepseek.com>

…ite-after-read (tile-ai#1781) * Fix thread storage synchronization logic in `thread_storage_sync.cc` to correctly identify conflicts between read and write operations based on loop carry conditions. * lint fix * Refactor `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` to use shared memory for `sorted_token_ids` instead of local memory, improving thread synchronization. Adjust default argument values for M, N, and K in the main function for better testing scenarios. * Add UniformExprChecker to enforce thread synchronization rules Introduce the UniformExprChecker class to determine if expressions are uniform across threads, crucial for safe synchronization in conditional statements. Update the TileLangThreadSyncPlanner to hoist synchronization points out of non-uniform if-statements to prevent potential deadlocks. Enhance tests to validate sync hoisting behavior for various non-uniform conditions involving thread indices and shared memory access. * lint fix * Enhance `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` with cache disabling and kernel source printing for debugging. Update thread synchronization logic in `thread_storage_sync.cc` to check for runtime-dependent conditions, preventing potential deadlocks by hoisting sync points as necessary. * Update submodule `tvm` to latest commit and remove deprecated `example_gqa_decode_varlen_logits_paged.py` file. Refactor `example_gqa_decode_varlen_logits.py` to enhance performance and maintainability by removing unused imports and optimizing shared memory usage. Adjust test cases to reflect the removal of the paged example. * fix * Enhance thread synchronization logic in `thread_storage_sync.cc` by adding a configurable warp size parameter to `RuntimeDependentConditionChecker` and `TileLangThreadSyncPlanner`. This allows for better adaptability to different target architectures. Update the logic to ensure thread extent is a constant and improve handling of runtime-dependent conditions. * lint fix * Refactor thread extent validation in `thread_storage_sync.cc` to use pointer checks instead of optional values. This change improves clarity and ensures that the thread extent is correctly validated as a constant. * Adjust loop variable constraints in `thread_storage_sync.cc` for loop-carry analysis by modifying the extent calculation. This change ensures valid iteration comparisons by reducing the extent by one, allowing for accurate analysis of loop iterations. * lint fix

…-ai#1789) * [Fix] cython 3.0 generates incorrect code for python stable api * Fix for 3.9: `A | B` is invalid as expression even with `__future__`

…riable dimensions correctly (tile-ai#1794) * [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly * lint fix

…-ai#1795) * Fix thread storage synchronization logic in `thread_storage_sync.cc` to correctly identify conflicts between read and write operations based on loop carry conditions. * lint fix * Refactor `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` to use shared memory for `sorted_token_ids` instead of local memory, improving thread synchronization. Adjust default argument values for M, N, and K in the main function for better testing scenarios. * Add UniformExprChecker to enforce thread synchronization rules Introduce the UniformExprChecker class to determine if expressions are uniform across threads, crucial for safe synchronization in conditional statements. Update the TileLangThreadSyncPlanner to hoist synchronization points out of non-uniform if-statements to prevent potential deadlocks. Enhance tests to validate sync hoisting behavior for various non-uniform conditions involving thread indices and shared memory access. * lint fix * Enhance `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` with cache disabling and kernel source printing for debugging. Update thread synchronization logic in `thread_storage_sync.cc` to check for runtime-dependent conditions, preventing potential deadlocks by hoisting sync points as necessary. * Update submodule `tvm` to latest commit and remove deprecated `example_gqa_decode_varlen_logits_paged.py` file. Refactor `example_gqa_decode_varlen_logits.py` to enhance performance and maintainability by removing unused imports and optimizing shared memory usage. Adjust test cases to reflect the removal of the paged example. * fix * Enhance thread synchronization logic in `thread_storage_sync.cc` by adding a configurable warp size parameter to `RuntimeDependentConditionChecker` and `TileLangThreadSyncPlanner`. This allows for better adaptability to different target architectures. Update the logic to ensure thread extent is a constant and improve handling of runtime-dependent conditions. * lint fix * Refactor thread extent validation in `thread_storage_sync.cc` to use pointer checks instead of optional values. This change improves clarity and ensures that the thread extent is correctly validated as a constant. * Adjust loop variable constraints in `thread_storage_sync.cc` for loop-carry analysis by modifying the extent calculation. This change ensures valid iteration comparisons by reducing the extent by one, allowing for accurate analysis of loop iterations. * lint fix * Refactor thread variable handling in `thread_storage_sync.cc` to improve conflict detection logic. Introduced shared variable usage for WAW/RAR access types and distinct variables for RAW/WAR types, enhancing the accuracy of cross-thread dependency checks. Updated thread condition logic accordingly. * lint fix

…memory access legalization (tile-ai#1797)

* Add tilelang semantics guide to programming guides section in documentation * refactor docs --------- Co-authored-by: SiriusNEO <chaofan@deepseek.com>

…tContains to layout utils (tile-ai#1779) * [Feature] Implement ProveFragmentContains Function for Fragment Thread Validation - Added the ProveFragmentContains function to check if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment. - This function ensures valid access when transitioning from a smaller to a larger fragment layout. - Updated layout.cc and utils.cc to incorporate this new functionality, enhancing the layout validation process. - Removed the previous implementation of ProveFragmentContains from parallel.cc to streamline the codebase. * fix * Refactor ParallelOpNode Layout Handling - Removed the initial DeReplicate attempt from InferLayout to streamline layout inference. - Added DeReplicate logic to ComputeLoopLayoutFromBuffer to reduce replication when validating layout candidates. - Updated test cases to disable caching and ensure proper functionality of loop layout kernels. * fix * Refactor Test Cases for Loop Layout - Removed caching disablement and print statements from the loop layout identity test for cleaner output. - Updated the main execution block to directly call the testing framework, enhancing test execution flow.

…on (tile-ai#1796) * [Feature] Support passing PrimExpr value in tile-level atomic operation * fix after rebase * address comments * fix tvm ver * fix

…elined (tile-ai#1799) * [BugFix] Fix loop-dependent conditions in IfThenElse within T.Pipelined This commit applies the same strategy used for LetStmt to IfThenElse conditions: 1.Introduced IfWrapper struct to track if conditions that depend on the loop variable 2.Added dependency detection that checks if an if condition uses: - The pipeline loop variable directly, OR - Any variable transitively dependent on the loop variable 3.Loop-dependent conditions are pushed inside each pipeline stage with the loop variable properly substituted for that iteration * Add test for loop-dependent condtions within T.Pipelined * Fix code format

…e-ai#1801) * [Fix] Update loop unswitching logic to handle multiple let bindings and add corresponding test case * remove debug print

Another fix in _typing, add to ci

[Docs][Puzzle] Add TileLang puzzles in README

…1811) * Enhance plot_layout function to support both Fragment and Layout types for visualization. Update parameters for colormap and formats, and introduce helper functions for format parsing and saving plots. Improve documentation for clarity on usage and expected input types. * lint fix * Refactor swizzle layout functions to use dedicated layout creators. Replace inline 2D swizzle functions with calls to `make_full_bank_swizzled_layout`, `make_half_bank_swizzled_layout`, and `make_quarter_bank_swizzled_layout` for improved clarity and maintainability in layout generation. * Remove outdated documentation from layout_swizzle.py and ensure plots are closed after saving in plot_layout.py for better resource management.

* profiler support cudagraph backend && AutoTuner support specified profiler backend * [Enhancement] Add CUDA graph replay options to autotuning and profiling * Introduced `cudagraph_n_replays` and `cudagraph_flush_per_iter` parameters across various functions to enhance CUDA graph profiling capabilities. * Updated `get_best_config`, `main`, and `do_bench` functions to support new parameters for improved benchmarking accuracy. * Enhanced `ProfileArgs` and `AutoTuner` classes to include new profiling options for better performance tuning. * Updated documentation to reflect changes in parameter usage and functionality. * revert changes --------- Co-authored-by: linjunxian <linjunxian@ai123.ink> Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

* handle staled autotune state with tvm-ffi adapter * fit the pre-commit linter issue

…le-ai#1816) * [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig * lint fix

Update dependencies to resolve several issues

…oved analysis

senlyu163 and others added 30 commits December 17, 2025 11:39

[Analyzer] Require loop extent > 0 when entering loop (tile-ai#1451)

f914f2d

Updat ROCm CI to Nightly-ROCm-7.1 (tile-ai#1449)

0c25c4f

[Issue Template] Enable blank issues in GitHub issue template(tile-ai…

aa19342

…#1453)

[CI] Moved the clang-tidy step to after pip install (tile-ai#1456)

6aaf3c7

[Bug] Fix tvm build script when patchelf is not found tile-ai#1459)

3ee0939

[Analyzer] Fix floordiv & floormod bug in z3 prover (tile-ai#1458)

91cf796

* fix floordiv & floormod in z3 prover * fix lint error

[Cache] Rename sparse compress cache directory (tile-ai#1460)

48e70e6

* Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions. * Fix formatting in sparse.py by adding a newline for improved readability and consistency.

remove unused duplicated type check (tile-ai#1462)

a6f59f3

Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>

[Language] Make TL scripts friendly to Python syntax highlights (tile…

1a3a64f

…-ai#1466) * Language] Make TL scripts friendly to Python syntax highlights * add comments * fix submodule

[Refactor] Remove triton dependence in testing & move triton baseline…

95e3b5a

… into examples (tile-ai#1470) * remove triton dependence in testing & move triton baseline into example * use ceildiv and handles arbitrary M correctly for triton

[Enhancement] Use static Z3 context (tile-ai#1482)

168aec7

* use static Z3 context * Update submodule reference for TVM to indicate a dirty state

Pin nvidia-cutlass-dsl to 4.3.3 (tile-ai#1497)

718e398

[Language] Remove ConstIf Frame for better meta programming (tile-ai#…

5acaab7

…1496)

[CI] Fix concurrency bug in regression test workflow (tile-ai#1500)

6e0982d

Updated concurrency group to use issue/PR number.

[Enhancement] Optimize MHA varlen fwd and support autotune (tile-ai#1499

2d8bf3e

) * [Enhancement] Optimize MHA varlen fwd and support autotune * use fa2 instead of fa3 as baseline in ci

SiriusNEO and others added 29 commits February 2, 2026 18:47

[Refactor] Unify the usage of cast-related operators (tile-ai#1757)

378a8f2

* [Refactor] Unify the usage of cast-related operators * reinterpret auto detect

[Example][BugFix] Fix arguements override in deepseek_v32 topk_select…

db03bba

…or (tile-ai#1784) [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector example.

fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitt…

bb0634d

…er (tile-ai#1786)

[Enhancement] Enhance register vectorize inference (tile-ai#1785)

df87c56

* fix * simplify the constraint * fix docs --------- Co-authored-by: SiriusNEO <chaofan@deepseek.com>

[Fix] cython 3.0 generates incorrect code for python stable api (tile…

841c446

…-ai#1789) * [Fix] cython 3.0 generates incorrect code for python stable api * Fix for 3.9: `A | B` is invalid as expression even with `__future__`

[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle va…

c1481eb

…riable dimensions correctly (tile-ai#1794) * [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly * lint fix

[Feature] Add option to disable out-of-bound access warnings in safe …

5951bce

…memory access legalization (tile-ai#1797)

[Docs] Add Python Compatibility document of TileLang (tile-ai#1745)

eb7f695

* Add tilelang semantics guide to programming guides section in documentation * refactor docs --------- Co-authored-by: SiriusNEO <chaofan@deepseek.com>

[Feature] Support passing PrimExpr value in tile-level atomic operati…

4349b2c

…on (tile-ai#1796) * [Feature] Support passing PrimExpr value in tile-level atomic operation * fix after rebase * address comments * fix tvm ver * fix

[BugFix] Missing Recursive Loop Var Checking in Loop Unswitching (til…

c47d6df

…e-ai#1801) * [Fix] Update loop unswitching logic to handle multiple let bindings and add corresponding test case * remove debug print

Fix a 3.9 issue. add _typing.py to dist check (tile-ai#1803)

7950dc5

Another fix in _typing, add to ci

[Docs][Puzzles] Add TileLang puzzles in README (tile-ai#1806)

46a4e76

[Docs][Puzzle] Add TileLang puzzles in README

[Docs] Hotfix wrong link (tile-ai#1807)

4786915

Handle staled autotune state with tvm-ffi adapter. (tile-ai#1812)

2172406

* handle staled autotune state with tvm-ffi adapter * fit the pre-commit linter issue

[BugFix] LoopUnswitching: gate non-trivial else behind PassConfig (ti…

e9d0569

…le-ai#1816) * [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig * lint fix

[Release] Update dependencies to resolve several issues (tile-ai#1817)

790388c

Update dependencies to resolve several issues

[Enhancement] Integrate arith::Analyzer into Loop Vectorizer for impr…

0c1df35

…oved analysis

kurisu6912 closed this Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr #2

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr #2

Uh oh!

kurisu6912 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr #2

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr #2

Uh oh!

Conversation

kurisu6912 commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants