[TLX] Improve PrintTTGIRToTLX output readability by tissue3 · Pull Request #983 · facebookexperimental/triton

tissue3 · 2026-02-26T00:04:59Z

Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

Inlining constants at use sites - Instead of emitting c32_i32 = 32 and referencing c32_i32, constants are now inlined directly as 32 where used. This significantly reduces noise from constant definitions.
Skipping non-meaningful operations - Operations that don't contribute to TLX understanding are now filtered out:
- gpu.barrier - not needed in TLX
- ttg.convert_layout - internal layout conversion
- tt.return / tt.reduce.return - terminators
- Various warp specialization internals already skipped
Skipping empty async_task blocks - Partition regions that only contain skipped operations (like a single tt.return) are now omitted, eliminating empty with tlx.async_task(): blocks.
Refactored skip logic - Replaced individual if statements with a llvm::StringSet<> lookup for cleaner, more maintainable code.

Test Plan:

Generated fwd.txt output using:

 WITH_OSS_WARPSPEC=1 TRITON_USE_META_PARTITION=1 TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force

Verified:

Constants are inlined (e.g., mul(arg2, 32) instead of mul(arg2, c32_i32))
No empty with tlx.async_task(): blocks at end of output
gpu.barrier, ttg.convert_layout, tt.return are not emitted
Before the change P2205734619
After the change P2208628663

Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Authored with Claude. Test Plan: 1. Generated fwd.txt output using: ``` TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ... ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Tasks: T252981529

meta-codesync · 2026-02-26T00:05:27Z

@tissue3 has imported this pull request. If you are a Meta employee, you can view this in D94436902.

manman-ren

LGTM! Thanks!

Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Authored with Claude. Test Plan: 1. Generated fwd.txt output using: ``` TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ... ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Tasks: T252981529

meta-codesync · 2026-02-28T11:37:46Z

@tissue3 merged this pull request in 4867b62.

Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Pull Request resolved: #983 Test Plan: 1. Generated fwd.txt output using: ``` TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Before the change [P2205734619](https://www.internalfb.com/phabricator/paste/view/P2205734619) After the change [P2208628663](https://www.internalfb.com/phabricator/paste/view/P2208628663) Reviewed By: jma2333 Differential Revision: D94436902 Pulled By: tissue3 fbshipit-source-id: ddaf3e9d939b25573b2d3cac400bccae3516df44

Summary: Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output: - **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`) - **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output - **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops - **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire` Authored with Claude. Pull Request resolved: #1005 Test Plan: 1. Build: `cd ~/triton; make` 2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt` 3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758) Reviewed By: prithvip0524 Differential Revision: D94788464 Pulled By: tissue3 fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d

… pass Summary: The PrintTTGIRToTLX debug pass mapped `ttg.memdesc_index` to `tlx.memdesc_index`, but the correct TLX Python API name is `tlx.local_view`. This change aligns the emitted pseudocode with the actual TLX DSL, making the output closer to compilable TLX Python. This is the rebased remainder of D94700558 after [PR #983](#983) and [PR #1005](#1005) landed upstream. Those PRs already covered 6 of 7 changes (math op mappings, unsigned div/rem, NaN-propagating min/max, skipping gpu.barrier/convert_layout, convert layout transparent substitution via getValueName, TMA descriptor, constant deduplication--obsolete as #983 inlines constants at use sites). Rebased and authored with Claude. Note that while the original intent of D94700558 was to ensure that generated TLX from the TTGIR-to-TLX pass could compile, it did not achieve this yet. This PR is to just get us up to speed with the earlier WIP, and then we'll post additional fixes on top of this towards compilable TLX Python. Differential Revision: D96554961

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 26, 2026

manman-ren approved these changes Feb 27, 2026

View reviewed changes

tissue3 mentioned this pull request Feb 27, 2026

[TLX] Improve PrintTTGIRToTLX on Control Flow #1005

Closed

Merge branch 'main' into tlx_print_inline_constant

9963692

meta-codesync Bot closed this in 4867b62 Feb 28, 2026

facebook-github-tools Bot added the Merged label Feb 28, 2026

jdonald mentioned this pull request Mar 14, 2026

[triton][tlx] Map ttg.memdesc_index to tlx.local_view in TTGIR-to-TLX pass #1087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLX] Improve PrintTTGIRToTLX output readability#983

[TLX] Improve PrintTTGIRToTLX output readability#983
tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
tissue3:tlx_print_inline_constant

tissue3 commented Feb 26, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Feb 26, 2026

Uh oh!

manman-ren left a comment

Uh oh!

meta-codesync Bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tissue3 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Feb 26, 2026

Uh oh!

manman-ren left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tissue3 commented Feb 26, 2026 •

edited

Loading