Skip to content

[Triton-MLIR][Backend]Add the missing support when MMAv1 as the paren…#983

Merged
ptillet merged 1 commit intotriton-lang:triton-mlirfrom
goostavz:goostavz/dev_mma_v1
Dec 14, 2022
Merged

[Triton-MLIR][Backend]Add the missing support when MMAv1 as the paren…#983
ptillet merged 1 commit intotriton-lang:triton-mlirfrom
goostavz:goostavz/dev_mma_v1

Conversation

@goostavz
Copy link
Copy Markdown
Collaborator

…t of sliceEncodingAttr

@goostavz goostavz requested a review from ptillet as a code owner December 14, 2022 12:00
@ptillet ptillet merged commit 8025455 into triton-lang:triton-mlir Dec 14, 2022
ZzEeKkAa pushed a commit to ZzEeKkAa/triton that referenced this pull request Aug 5, 2024
The tests cases mentioned in triton-lang#983 have been added to A770 skip list.

Fixes triton-lang#1579.
scxiao pushed a commit to scxiao/triton that referenced this pull request Apr 2, 2026
Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions.

2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out:
   - `gpu.barrier` - not needed in TLX
   - `ttg.convert_layout` - internal layout conversion
   - `tt.return` / `tt.reduce.return` - terminators
   - Various warp specialization internals already skipped

3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks.

4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code.

Pull Request resolved: facebookexperimental/triton#983

Test Plan:
1. Generated fwd.txt output using:
```
TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force
```

2. Verified:
- Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`)
- No empty `with tlx.async_task():` blocks at end of output
- `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted
Before the change [P2205734619](https://www.internalfb.com/phabricator/paste/view/P2205734619)
After the change [P2208628663](https://www.internalfb.com/phabricator/paste/view/P2208628663)

Reviewed By: jma2333

Differential Revision: D94436902

Pulled By: tissue3

fbshipit-source-id: ddaf3e9d939b25573b2d3cac400bccae3516df44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants