[Triton-MLIR][Backend]Add the missing support when MMAv1 as the paren…#983
Merged
ptillet merged 1 commit intotriton-lang:triton-mlirfrom Dec 14, 2022
Merged
Conversation
…t of sliceEncodingAttr
ptillet
approved these changes
Dec 14, 2022
ZzEeKkAa
pushed a commit
to ZzEeKkAa/triton
that referenced
this pull request
Aug 5, 2024
The tests cases mentioned in triton-lang#983 have been added to A770 skip list. Fixes triton-lang#1579.
scxiao
pushed a commit
to scxiao/triton
that referenced
this pull request
Apr 2, 2026
Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Pull Request resolved: facebookexperimental/triton#983 Test Plan: 1. Generated fwd.txt output using: ``` TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Before the change [P2205734619](https://www.internalfb.com/phabricator/paste/view/P2205734619) After the change [P2208628663](https://www.internalfb.com/phabricator/paste/view/P2208628663) Reviewed By: jma2333 Differential Revision: D94436902 Pulled By: tissue3 fbshipit-source-id: ddaf3e9d939b25573b2d3cac400bccae3516df44
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…t of sliceEncodingAttr