Skip to content

Conversation

@kurisu6912
Copy link
Owner

senlyu163 and others added 30 commits December 17, 2025 11:39
…ile-ai#1445)

* Remove JIT decorator from elementwise_add function in examples

* fix kernel compilation without autotune

* Refactor main function to accept parameters and update tests for autotune option

* Refactor autotune test function for morden style
* [Enhancement] Introduce non-restrict parameter support in code generation

- Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
- Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
- Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
- Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
- Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.

* [Enhancement] Improve HoistNonRestrictParams transformation

- Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
- Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
- Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
- Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.

* [Refactor] Simplify non-restrict parameter handling in code generation

- Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
- Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
- Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.

* [Dependency] Update TVM subproject to latest commit 68aa8461

- Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
- Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
- Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.

* fix
…nctionality (tile-ai#1448)

* [Enhancement] Update examples and tests for improved type handling and functionality

- Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
- Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
- Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
- Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.

* [Refactor] Update accumulation data type to float32 across examples

- Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
- This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.

* [Refactor] Standardize data type usage across benchmark scripts

- Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
- Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
- Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.

* [Refactor] Standardize data type usage in templates and scripts

- Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
- Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
- This change aims to streamline type handling and improve compatibility with existing workflows.

* [Refactor] Standardize data type usage in examples and benchmarks

- Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
- Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.

* [Refactor] Import dtypes from language.v2 module

- Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
- This change aims to streamline data type management and improve overall code clarity.

* fix

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
- Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
- This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.

* [Refactor] Update data type handling for consistency and clarity

- Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
- Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
- Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
- This refactor aims to streamline data type management and improve overall code clarity and maintainability.

* [Enhancement] Improve data type handling and error messaging

- Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
- Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
- Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.

* [Fix] Correct boolean flag in GEMM SP test case

- Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
- This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.

* [Refactor] Standardize data type usage across scripts

- Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
- Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.

* [Refactor] Standardize data type usage in various modules

- Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
- Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
- This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.

* [Refactor] Update argument parsing for data types in benchmarks

- Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
- This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.

* [Refactor] Update data type handling in benchmark and example scripts

- Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
- Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
- This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.

* [Refactor] Fix data type conversion in multiple scripts

- Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
- This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.

* [Refactor] Update float8 data type usage across multiple scripts

- Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
- This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.

* [Refactor] Enhance float8 data type handling in CUDA code generation

- Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
- Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
- Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Streamline float8 data type handling in CUDA and related modules

- Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
- Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
- This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.

* [Refactor] Remove unnecessary cache disabling in float8 example script

- Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
- This change enhances clarity and maintainability of the example script without affecting its functionality.

* [Refactor] Update data type usage in debug print tests

- Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
- This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.

* lint fix

* Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples

* Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
* fix floordiv & floormod in z3 prover

* fix lint error
* Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions.

* Fix formatting in sparse.py by adding a newline for improved readability and consistency.
…ernel (tile-ai#1461)

* add curand.{curand_init, curand}

* run format.sh

* add default value for curand_init & add test for curand

* Update testing/python/language/test_rand.py

Remove unused thread binding

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* remove unused library

* enable tilelang cache for testing

* run format.sh

* Revert "run format.sh"

This reverts commit 5afaff7.

* Revert "enable tilelang cache for testing"

This reverts commit c277a43.

* Revert "remove unused library"

This reverts commit 568ad20.

* run format.sh

* ensure FreshName for __philox_state

* ensure FreshName for __philox_state

* change the return type of T.rng_init

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Jinjie Liu <jjliu@baai.ac.cn>
* feat: CuTeDSL backend

* fix: clang-tidy

* fix: clang-format

* fix: ci

* fix: revert example gemm fp8

* fix: remove duplicate code

* fix: switch-case

* fix: fp16 silence

* fix: TVM IR print

* fix: useless tir

* fix: clang-format

* fix: remove tilelang/contrib/cutedsl/.gitignore

* fix: use hexfloat

* fix: gsym guard

* fix: unknown storage sync type

* fix: string literal

* fix: add args guard

* fix: name hint dedup

* fix: better find_kernel_by_pattern

* fix: set libpath for from_database path

* fix: guard buffer.strides

* fix: from guard

* fix: eviction guard

* fix: use thread local tma descs

* fix: ruff

* fix: drop tma_init_cpp

* fix: exc_info

* fix: negative unmatch early return

* fix: rename postproc func and add test

* fix: handle fast math according to pass config

* fix: dyn_sym parse

* fix: wrap_forward

* fix: use tvm_ffi.libinfo instead of cli

* fix: keep signature

* fix: C++ string safety

* fix: mark tma_store_add as unsupported

* fix: tvm version

* resolve ldsm and cpasync issues.

* fix: minor fixes

* fix: parse signature using ast

* fix: guard global_addr

* fix: create tempfile only when necessary

* fix: use logger.execption for exceptions

* fix: guard lib_path and host_func

* fix: remove tma_cpp_init and add timeout for cpp compile

* add timeout for mbarrier_wait.

* fix: _load_kernel_from_disk signature

* resolve codegen issues.

* fix: logger.exception

* add comment for div_by=1

* merge

* fix: reserve cutlass,cute,tl

* fix: guard tma_store

* fix: allow int64 offset in make_tensor_at_offset

* fix: guard barrier

* fix: add comments for div_by=16

* fix: div_by=1 issue

* delete div_by when offset is 0

* use tl.make_tensor when offset is 0

* fix: explicitly check cutedsl target

* fix: use param.torch_dtype()

---------

Co-authored-by: yuxic <yuxic@nvidia.com>
Co-authored-by: Yong <yong@local>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…lang_language_rand.py` (tile-ai#1464)

* rename test for curand & add triton baseline

* add a comment for calling T.rng_rand() four times

* refactor tilelang&triton kernel

* Add boundary checks for M not divisible by 128
)

* feat(arg_binder): enhance shape variable handling and assertions

- Implemented special handling for comparing if_then_else expressions to simplify conditions involving NULL checks.
- Added methods to set shared shape variables and finalize deferred bindings, generating cascading if_then_else expressions and runtime assertions for non-NULL buffers.
- Updated the binding logic to defer shape variable bindings for shared variables, ensuring proper handling across multiple nullable buffers.

* refactor(arg_binder): clean up shape variable handling and remove unused code

- Removed deprecated methods for setting shared shape variables and finalizing deferred bindings, streamlining the argument binding process.
- Simplified the logic for handling shape values in the `BindDLTensor` function, ensuring immediate binding for normal shape variables.
- Enhanced clarity by eliminating unnecessary comments and code related to cascading if_then_else expressions for shared variables.

* refactor(arg_binder): enhance DLTensor binding with improved shape handling

- Replaced the single `BindDLTensor` method with `BindDLTensors` to support multiple buffers, improving flexibility in handling DLTensor bindings.
- Introduced a two-pass approach for shape variable handling, allowing for better management of symbolic dimensions and null checks.
- Updated the logic to assert non-null conditions at runtime and utilize cascaded if_then_else expressions for shape retrieval, enhancing robustness.
- Removed deprecated code and streamlined the binding process for clarity and maintainability.

* fix(test_nullable_buffer_params): improve formatting and consistency in test output

- Updated string formatting for better readability in the `test_nullable_shared_shape` function.
- Ensured consistent use of double quotes for string literals.
- Added a missing newline at the end of the file for proper formatting.

* refactor(arg_binder): simplify allocation size calculation in BindDLTensors

- Streamlined the calculation of allocation size by replacing a lambda function with a direct loop, enhancing readability and maintainability.
- Improved clarity in the null check message for data pointers, ensuring better understanding of the binding process.

* Remove debug prints from phase.py

Removed debug print statements after MakePackedAPI transformation.
…-ai#1466)

* Language] Make TL scripts friendly to Python syntax highlights

* add comments

* fix submodule
… into examples (tile-ai#1470)

* remove triton dependence in testing & move triton baseline into example

* use ceildiv and handles arbitrary M correctly for triton
…e-ai#1473)

* [Language] Enhance dtype conversion for PyTorch compatibility

- Added support for new float8 and float4 data types in the __dtype_as_torch__ method.
- Implemented backend-specific handling for float8_e4m3 based on HIP or CUDA.
- Included assertions to ensure compatibility with the required PyTorch versions for each dtype.
- Improved error handling for unsupported dtypes.

* Fix test script execution and improve error messages for dtype assertions

- Commented out the main execution call in the test script and replaced it with a direct call to the test function `test_divmod()`.
- Enhanced error messages in the dtype conversion assertions to improve clarity and readability, ensuring proper guidance for required PyTorch versions.
* Update README.md with latest news, including CuTeDSL backend support, Z3 theorem prover integration, and migration to apache-tvm-ffi for improved compatibility.

* Update README.md to enhance CuTeDSL backend announcement with a link to related issue and clarify migration benefits to apache-tvm-ffi, reducing CPU overhead.
* use static Z3 context

* Update submodule reference for TVM to indicate a dirty state
…rp specialized pass (tile-ai#1484)

* [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference

* Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer.
* Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference.
* Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy.
* Added documentation for new methods and updated existing methods to support the new layout features.

* lint fix

* Remove debug logging statements from layout inference process to streamline output and improve performance.
…icAlignment` as they are legacy (tile-ai#1486)

* [Cleanup] Remove dynamic shape example and related tests

* Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase.
* Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`.
* Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability.

* build fix
…Evaluator (tile-ai#1491)

* [Cleanup] Remove dynamic shape example and related tests

* Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase.
* Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`.
* Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability.

* build fix

* Update submodule reference for TVM to latest commit 315036dc

* phaseout z3
* [Feature]: Add benchmark scripts for examples

* apply cupti

* fix

* format

* initial commit

* fix

* upd

* upd

* lint

* fix

* fake

* Simplify PR regression test workflow

Removed redundant 'Clean pip environment' steps from the workflow.

* Update test_perf_regression.py

* Enhance regression test bot workflow file handling

Updated the GitHub Actions workflow to improve file handling for the regression test report.

* Update regression test workflow for artifact naming

* Update pr-regression-test-bot.yml

* fix

* lint

* Update performance regression test trigger conditions

---------

Co-authored-by: yyttt6 <1652272478@qq.com>
Updated concurrency group to use issue/PR number.
…troduce processing for floating fragment buffers (tile-ai#1495)

* [Refactor] Replace local allocations with variable allocations in various examples and operations

* Updated multiple files to replace local buffer allocations with variable allocations for improved performance and clarity.
* Changed `alloc_local` to `alloc_var` in examples related to attention mechanisms, deep learning models, and GEMM operations.
* Enhanced code readability and maintainability by streamlining buffer management across different components.
* Ensured consistent handling of buffer scopes and types throughout the codebase.

* typo fix

* test fix

* [Refactor] Simplify index handling in sparse MLA forward pipelined example

* Updated index handling in `sparse_mla_fwd_pipelined.py` to eliminate unnecessary local array usage, improving code clarity and performance.
* Replaced instances of `indices_local[0]` with direct usage of `indices_local` for better readability and consistency in buffer access.
* Commented out the main execution call in the GDN test script to focus on the specific test function, enhancing test clarity.

* lint fix
)

* [Enhancement] Optimize MHA varlen fwd and support autotune

* use fa2 instead of fa3 as baseline in ci
…upported FP8 type (tile-ai#1474)

* Refactor CUDA vectorized cast generation and remove unsupported FP8 type

* test fix

* lint fix

* Refactor CUDA vectorized cast function naming for clarity

* Add support for float4_e2m1fn type conversions in CUDA vectorized casts

- Implemented conversions between float4_e2m1fn and float32, half2, and float2 in utils.cc and cuda_fp4.h.
- Updated test_tilelang_language_vectorized_cast.py to validate new conversions and ensure correctness.
- Enhanced dtype conversion in dtypes.py to handle float4_e2m1fn appropriately, logging a warning for unsupported types in PyTorch.

* Enhance vectorized cast tests for new data types

- Added tests for vectorized casting of float8 and float4 data types, ensuring compatibility with CUDA compute versions.
- Refactored existing test functions to improve clarity and organization, separating tests for different data types.
- Updated parameterization to include additional test cases for new conversions.

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
SiriusNEO and others added 29 commits February 2, 2026 18:47
* [Refactor] Unify the usage of cast-related operators

* reinterpret auto detect
…IT compilations (tile-ai#1776)

Refactor pass_configs initialization in JITKernel to ensure a new dictionary is created if pass_configs is not None. This change improves clarity and prevents potential issues with mutable default arguments.
* [CI] [pre-commit.ci] autoupdate

updates:
- [github.com/astral-sh/ruff-pre-commit: v0.14.11 → v0.14.14](astral-sh/ruff-pre-commit@v0.14.11...v0.14.14)
- [github.com/jackdewinter/pymarkdown: v0.9.34 → v0.9.35](jackdewinter/pymarkdown@v0.9.34...v0.9.35)

* sync requirements-lint.txt

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: SiriusNEO <chaofan@deepseek.com>
…ontend (tile-ai#1777)

* temp

* temp

* [Refactor] Improve type annotations and reduce some lint errors

* some fixes

* update

* update

* address comments

* address comments

* fix print

* address comments

* refactor typing to _typing

* fix more

* fix reduce

* no return

* fix

* fix cumsum
Update TVM submodule: fix select/if_then_else OOB access

Update TVM to include fix for out-of-bounds memory access when
if_then_else is nested inside select during code generation.

See: tile-ai/tvm#26

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…ile-ai#1772)

* [Feature] Add fully replicated layout interface in annotation layout

* Lint

* Remove test for issue 1729 from the tilelang testing suite

---------

Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
…or (tile-ai#1784)

[Example][BugFix] Fix arguements override in deepseek_v32 topk_selector example.
…ile-ai#1778)

* Fix type annotations for T.reshape and T.view

* Fix issue tile-ai#1666: reduce_sum with clear=False not accumulating correctly

* address comments and add testcases

* add more tests

---------

Co-authored-by: SiriusNEO <chaofan@deepseek.com>
* fix

* simplify the constraint

* fix docs

---------

Co-authored-by: SiriusNEO <chaofan@deepseek.com>
…ite-after-read (tile-ai#1781)

* Fix thread storage synchronization logic in `thread_storage_sync.cc` to correctly identify conflicts between read and write operations based on loop carry conditions.

* lint fix

* Refactor `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` to use shared memory for `sorted_token_ids` instead of local memory, improving thread synchronization. Adjust default argument values for M, N, and K in the main function for better testing scenarios.

* Add UniformExprChecker to enforce thread synchronization rules

Introduce the UniformExprChecker class to determine if expressions are uniform across threads, crucial for safe synchronization in conditional statements. Update the TileLangThreadSyncPlanner to hoist synchronization points out of non-uniform if-statements to prevent potential deadlocks. Enhance tests to validate sync hoisting behavior for various non-uniform conditions involving thread indices and shared memory access.

* lint fix

* Enhance `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` with cache disabling and kernel source printing for debugging. Update thread synchronization logic in `thread_storage_sync.cc` to check for runtime-dependent conditions, preventing potential deadlocks by hoisting sync points as necessary.

* Update submodule `tvm` to latest commit and remove deprecated `example_gqa_decode_varlen_logits_paged.py` file. Refactor `example_gqa_decode_varlen_logits.py` to enhance performance and maintainability by removing unused imports and optimizing shared memory usage. Adjust test cases to reflect the removal of the paged example.

* fix

* Enhance thread synchronization logic in `thread_storage_sync.cc` by adding a configurable warp size parameter to `RuntimeDependentConditionChecker` and `TileLangThreadSyncPlanner`. This allows for better adaptability to different target architectures. Update the logic to ensure thread extent is a constant and improve handling of runtime-dependent conditions.

* lint fix

* Refactor thread extent validation in `thread_storage_sync.cc` to use pointer checks instead of optional values. This change improves clarity and ensures that the thread extent is correctly validated as a constant.

* Adjust loop variable constraints in `thread_storage_sync.cc` for loop-carry analysis by modifying the extent calculation. This change ensures valid iteration comparisons by reducing the extent by one, allowing for accurate analysis of loop iterations.

* lint fix
…-ai#1789)

* [Fix] cython 3.0 generates incorrect code for python stable api

* Fix for 3.9: `A | B` is invalid as expression even with `__future__`
…riable dimensions correctly (tile-ai#1794)

* [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly

* lint fix
…-ai#1795)

* Fix thread storage synchronization logic in `thread_storage_sync.cc` to correctly identify conflicts between read and write operations based on loop carry conditions.

* lint fix

* Refactor `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` to use shared memory for `sorted_token_ids` instead of local memory, improving thread synchronization. Adjust default argument values for M, N, and K in the main function for better testing scenarios.

* Add UniformExprChecker to enforce thread synchronization rules

Introduce the UniformExprChecker class to determine if expressions are uniform across threads, crucial for safe synchronization in conditional statements. Update the TileLangThreadSyncPlanner to hoist synchronization points out of non-uniform if-statements to prevent potential deadlocks. Enhance tests to validate sync hoisting behavior for various non-uniform conditions involving thread indices and shared memory access.

* lint fix

* Enhance `example_dequant_groupedgemm_bf16_mxfp4_hopper.py` with cache disabling and kernel source printing for debugging. Update thread synchronization logic in `thread_storage_sync.cc` to check for runtime-dependent conditions, preventing potential deadlocks by hoisting sync points as necessary.

* Update submodule `tvm` to latest commit and remove deprecated `example_gqa_decode_varlen_logits_paged.py` file. Refactor `example_gqa_decode_varlen_logits.py` to enhance performance and maintainability by removing unused imports and optimizing shared memory usage. Adjust test cases to reflect the removal of the paged example.

* fix

* Enhance thread synchronization logic in `thread_storage_sync.cc` by adding a configurable warp size parameter to `RuntimeDependentConditionChecker` and `TileLangThreadSyncPlanner`. This allows for better adaptability to different target architectures. Update the logic to ensure thread extent is a constant and improve handling of runtime-dependent conditions.

* lint fix

* Refactor thread extent validation in `thread_storage_sync.cc` to use pointer checks instead of optional values. This change improves clarity and ensures that the thread extent is correctly validated as a constant.

* Adjust loop variable constraints in `thread_storage_sync.cc` for loop-carry analysis by modifying the extent calculation. This change ensures valid iteration comparisons by reducing the extent by one, allowing for accurate analysis of loop iterations.

* lint fix

* Refactor thread variable handling in `thread_storage_sync.cc` to improve conflict detection logic. Introduced shared variable usage for WAW/RAR access types and distinct variables for RAW/WAR types, enhancing the accuracy of cross-thread dependency checks. Updated thread condition logic accordingly.

* lint fix
* Add tilelang semantics guide to programming guides section in documentation

* refactor docs

---------

Co-authored-by: SiriusNEO <chaofan@deepseek.com>
…tContains to layout utils (tile-ai#1779)

* [Feature] Implement ProveFragmentContains Function for Fragment Thread Validation

- Added the ProveFragmentContains function to check if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment.
- This function ensures valid access when transitioning from a smaller to a larger fragment layout.
- Updated layout.cc and utils.cc to incorporate this new functionality, enhancing the layout validation process.
- Removed the previous implementation of ProveFragmentContains from parallel.cc to streamline the codebase.

* fix

* Refactor ParallelOpNode Layout Handling

- Removed the initial DeReplicate attempt from InferLayout to streamline layout inference.
- Added DeReplicate logic to ComputeLoopLayoutFromBuffer to reduce replication when validating layout candidates.
- Updated test cases to disable caching and ensure proper functionality of loop layout kernels.

* fix

* Refactor Test Cases for Loop Layout

- Removed caching disablement and print statements from the loop layout identity test for cleaner output.
- Updated the main execution block to directly call the testing framework, enhancing test execution flow.
…on (tile-ai#1796)

* [Feature] Support passing PrimExpr value in tile-level atomic operation

* fix after rebase

* address comments

* fix tvm ver

* fix
…elined (tile-ai#1799)

* [BugFix] Fix loop-dependent conditions in IfThenElse within T.Pipelined

This commit applies the same strategy used for LetStmt to IfThenElse conditions:

1.Introduced IfWrapper struct to track if conditions that depend on the loop variable
2.Added dependency detection that checks if an if condition uses:
- The pipeline loop variable directly, OR
- Any variable transitively dependent on the loop variable
3.Loop-dependent conditions are pushed inside each pipeline stage with the
loop variable properly substituted for that iteration

* Add test for loop-dependent condtions within T.Pipelined

* Fix code format
…e-ai#1801)

* [Fix] Update loop unswitching logic to handle multiple let bindings and add corresponding test case

* remove debug print
[Docs][Puzzle] Add TileLang puzzles in README
…1811)

* Enhance plot_layout function to support both Fragment and Layout types for visualization. Update parameters for colormap and formats, and introduce helper functions for format parsing and saving plots. Improve documentation for clarity on usage and expected input types.

* lint fix

* Refactor swizzle layout functions to use dedicated layout creators. Replace inline 2D swizzle functions with calls to `make_full_bank_swizzled_layout`, `make_half_bank_swizzled_layout`, and `make_quarter_bank_swizzled_layout` for improved clarity and maintainability in layout generation.

* Remove outdated documentation from layout_swizzle.py and ensure plots are closed after saving in plot_layout.py for better resource management.
* profiler support cudagraph backend && AutoTuner support specified profiler backend

* [Enhancement] Add CUDA graph replay options to autotuning and profiling

* Introduced `cudagraph_n_replays` and `cudagraph_flush_per_iter` parameters across various functions to enhance CUDA graph profiling capabilities.
* Updated `get_best_config`, `main`, and `do_bench` functions to support new parameters for improved benchmarking accuracy.
* Enhanced `ProfileArgs` and `AutoTuner` classes to include new profiling options for better performance tuning.
* Updated documentation to reflect changes in parameter usage and functionality.

* revert changes

---------

Co-authored-by: linjunxian <linjunxian@ai123.ink>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
* handle staled autotune state with tvm-ffi adapter

* fit the pre-commit linter issue
…le-ai#1816)

* [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig

* lint fix
Update dependencies to resolve several issues
@kurisu6912 kurisu6912 closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] CuTe-DSL backend wrongly converts tanh to tanhf(op) as opposed to tanh(op, fastmath=True) [BUG] Weird TVM internal Error