[TLX] Improve PrintTTGIRToTLX on Control Flow by tissue3 · Pull Request #1005 · facebookexperimental/triton

tissue3 · 2026-02-27T22:09:42Z

Summary

Follow-up to #983. Further improves the PrintTTGIRToTLX pass to produce more Python-like output:

Comparison inlining: cmpi/cmpf printed as <, >=, == etc. and inlined into if/while conditions (e.g., if var_109 < var_105:)
Cast transparency: Type casts (extui, trunci, truncf, sitofp, bitcast, etc.) resolved through to source operand and skipped in output
Structured control flow: cf.br/cf.cond_br converted to if/else, while, and for var in tl.range(start, end[, step]): loops
TMA op mappings: Added ttng.tensormap_create, ttg.global_scratch_alloc, ttng.tensormap_fenceproxy_acquire
Authored with Claude.

Test Plan

Build: cd ~/triton; make
Generate output with TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt
Generated fwd.txt: P2209959758

tissue3 · 2026-02-27T22:29:41Z

need to review after #983 land

Summary: This change improves the TLX emission pass to produce cleaner, more readable output by: 1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions. 2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out: - `gpu.barrier` - not needed in TLX - `ttg.convert_layout` - internal layout conversion - `tt.return` / `tt.reduce.return` - terminators - Various warp specialization internals already skipped 3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks. 4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code. Authored with Claude. Test Plan: 1. Generated fwd.txt output using: ``` TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ... ``` 2. Verified: - Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`) - No empty `with tlx.async_task():` blocks at end of output - `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted Tasks: T252981529

meta-codesync · 2026-02-28T18:39:03Z

@tissue3 has imported this pull request. If you are a Meta employee, you can view this in D94788464.

manman-ren

LGTM! Thanks!

meta-codesync · 2026-03-05T09:26:11Z

@tissue3 merged this pull request in e3fc12a.

Summary: Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output: - **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`) - **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output - **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops - **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire` Authored with Claude. Pull Request resolved: #1005 Test Plan: 1. Build: `cd ~/triton; make` 2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt` 3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758) Reviewed By: prithvip0524 Differential Revision: D94788464 Pulled By: tissue3 fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d

… pass Summary: The PrintTTGIRToTLX debug pass mapped `ttg.memdesc_index` to `tlx.memdesc_index`, but the correct TLX Python API name is `tlx.local_view`. This change aligns the emitted pseudocode with the actual TLX DSL, making the output closer to compilable TLX Python. This is the rebased remainder of D94700558 after [PR #983](#983) and [PR #1005](#1005) landed upstream. Those PRs already covered 6 of 7 changes (math op mappings, unsigned div/rem, NaN-propagating min/max, skipping gpu.barrier/convert_layout, convert layout transparent substitution via getValueName, TMA descriptor, constant deduplication--obsolete as #983 inlines constants at use sites). Rebased and authored with Claude. Note that while the original intent of D94700558 was to ensure that generated TLX from the TTGIR-to-TLX pass could compile, it did not achieve this yet. This PR is to just get us up to speed with the earlier WIP, and then we'll post additional fixes on top of this towards compilable TLX Python. Differential Revision: D96554961

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026

tissue3 changed the title ~~[TLX] Improve PrintTTGIRToTLX: comparisons, casts, control flow~~ [TLX] Improve PrintTTGIRToTLX on Control Flow Feb 27, 2026

tissue3 closed this Feb 27, 2026

tissue3 reopened this Feb 27, 2026

tissue3 force-pushed the tlx_dump_v2 branch from b0800f8 to 82f7572 Compare February 28, 2026 18:31

tissue3 added 2 commits February 28, 2026 10:36

[TLX] Improve PrintTTGIRToTLX for control flow

534b022

tissue3 force-pushed the tlx_dump_v2 branch from 82f7572 to 534b022 Compare February 28, 2026 18:36

Merge branch 'main' into tlx_dump_v2

8a25cfa

manman-ren approved these changes Mar 4, 2026

View reviewed changes

Merge branch 'main' into tlx_dump_v2

753a49f

meta-codesync Bot closed this in e3fc12a Mar 5, 2026

facebook-github-tools Bot added the Merged label Mar 5, 2026

jdonald mentioned this pull request Mar 14, 2026

[triton][tlx] Map ttg.memdesc_index to tlx.local_view in TTGIR-to-TLX pass #1087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TLX] Improve PrintTTGIRToTLX on Control Flow#1005

[TLX] Improve PrintTTGIRToTLX on Control Flow#1005
tissue3 wants to merge 4 commits into
facebookexperimental:mainfrom
tissue3:tlx_dump_v2

tissue3 commented Feb 27, 2026

Uh oh!

tissue3 commented Feb 27, 2026

Uh oh!

meta-codesync Bot commented Feb 28, 2026

Uh oh!

manman-ren left a comment

Uh oh!

meta-codesync Bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tissue3 commented Feb 27, 2026

Summary

Test Plan

Uh oh!

tissue3 commented Feb 27, 2026

Uh oh!

meta-codesync Bot commented Feb 28, 2026

Uh oh!

manman-ren left a comment

Choose a reason for hiding this comment

Uh oh!

meta-codesync Bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants