[TLX] Improve PrintTTGIRToTLX output readability#983
Closed
tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
Closed
[TLX] Improve PrintTTGIRToTLX output readability#983tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
Conversation
Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Authored with Claude. Test Plan: 1. Generated fwd.txt output using: ``` TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ... ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Tasks: T252981529
Contributor
Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Authored with Claude. Test Plan: 1. Generated fwd.txt output using: ``` TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ... ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Tasks: T252981529
Contributor
htyu
pushed a commit
that referenced
this pull request
Mar 3, 2026
Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Pull Request resolved: #983 Test Plan: 1. Generated fwd.txt output using: ``` TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Before the change [P2205734619](https://www.internalfb.com/phabricator/paste/view/P2205734619) After the change [P2208628663](https://www.internalfb.com/phabricator/paste/view/P2208628663) Reviewed By: jma2333 Differential Revision: D94436902 Pulled By: tissue3 fbshipit-source-id: ddaf3e9d939b25573b2d3cac400bccae3516df44
meta-codesync Bot
pushed a commit
that referenced
this pull request
Mar 5, 2026
Summary: Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output: - **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`) - **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output - **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops - **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire` Authored with Claude. Pull Request resolved: #1005 Test Plan: 1. Build: `cd ~/triton; make` 2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt` 3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758) Reviewed By: prithvip0524 Differential Revision: D94788464 Pulled By: tissue3 fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d
htyu
pushed a commit
that referenced
this pull request
Mar 5, 2026
Summary: Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output: - **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`) - **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output - **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops - **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire` Authored with Claude. Pull Request resolved: #1005 Test Plan: 1. Build: `cd ~/triton; make` 2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt` 3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758) Reviewed By: prithvip0524 Differential Revision: D94788464 Pulled By: tissue3 fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d
meta-codesync Bot
pushed a commit
that referenced
this pull request
Mar 14, 2026
… pass Summary: The PrintTTGIRToTLX debug pass mapped `ttg.memdesc_index` to `tlx.memdesc_index`, but the correct TLX Python API name is `tlx.local_view`. This change aligns the emitted pseudocode with the actual TLX DSL, making the output closer to compilable TLX Python. This is the rebased remainder of D94700558 after [PR #983](#983) and [PR #1005](#1005) landed upstream. Those PRs already covered 6 of 7 changes (math op mappings, unsigned div/rem, NaN-propagating min/max, skipping gpu.barrier/convert_layout, convert layout transparent substitution via getValueName, TMA descriptor, constant deduplication--obsolete as #983 inlines constants at use sites). Rebased and authored with Claude. Note that while the original intent of D94700558 was to ensure that generated TLX from the TTGIR-to-TLX pass could compile, it did not achieve this yet. This PR is to just get us up to speed with the earlier WIP, and then we'll post additional fixes on top of this towards compilable TLX Python. Differential Revision: D96554961
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:
Inlining constants at use sites - Instead of emitting
c32_i32 = 32and referencingc32_i32, constants are now inlined directly as32where used. This significantly reduces noise from constant definitions.Skipping non-meaningful operations - Operations that don't contribute to TLX understanding are now filtered out:
gpu.barrier- not needed in TLXttg.convert_layout- internal layout conversiontt.return/tt.reduce.return- terminatorsSkipping empty async_task blocks - Partition regions that only contain skipped operations (like a single
tt.return) are now omitted, eliminating emptywith tlx.async_task():blocks.Refactored skip logic - Replaced individual
ifstatements with allvm::StringSet<>lookup for cleaner, more maintainable code.Test Plan:
mul(arg2, 32)instead ofmul(arg2, c32_i32))with tlx.async_task():blocks at end of outputgpu.barrier,ttg.convert_layout,tt.returnare not emittedBefore the change P2205734619
After the change P2208628663