Skip to content

[TLX] Improve PrintTTGIRToTLX on Control Flow#1005

Closed
tissue3 wants to merge 4 commits into
facebookexperimental:mainfrom
tissue3:tlx_dump_v2
Closed

[TLX] Improve PrintTTGIRToTLX on Control Flow#1005
tissue3 wants to merge 4 commits into
facebookexperimental:mainfrom
tissue3:tlx_dump_v2

Conversation

@tissue3
Copy link
Copy Markdown
Contributor

@tissue3 tissue3 commented Feb 27, 2026

Summary

Follow-up to #983. Further improves the PrintTTGIRToTLX pass to produce more Python-like output:

  • Comparison inlining: cmpi/cmpf printed as <, >=, == etc. and inlined into if/while conditions (e.g., if var_109 < var_105:)
  • Cast transparency: Type casts (extui, trunci, truncf, sitofp, bitcast, etc.) resolved through to source operand and skipped in output
  • Structured control flow: cf.br/cf.cond_br converted to if/else, while, and for var in tl.range(start, end[, step]): loops
  • TMA op mappings: Added ttng.tensormap_create, ttg.global_scratch_alloc, ttng.tensormap_fenceproxy_acquire
    Authored with Claude.

Test Plan

  1. Build: cd ~/triton; make
  2. Generate output with TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt
  3. Generated fwd.txt: P2209959758

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2026
@tissue3 tissue3 changed the title [TLX] Improve PrintTTGIRToTLX: comparisons, casts, control flow [TLX] Improve PrintTTGIRToTLX on Control Flow Feb 27, 2026
@tissue3 tissue3 closed this Feb 27, 2026
@tissue3 tissue3 reopened this Feb 27, 2026
@tissue3
Copy link
Copy Markdown
Contributor Author

tissue3 commented Feb 27, 2026

need to review after #983 land

Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions.

2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out:
   - `gpu.barrier` - not needed in TLX
   - `ttg.convert_layout` - internal layout conversion
   - `tt.return` / `tt.reduce.return` - terminators
   - Various warp specialization internals already skipped

3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks.

4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code.

Authored with Claude.

Test Plan:
1. Generated fwd.txt output using:
```
TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ...
```

2. Verified:
- Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`)
- No empty `with tlx.async_task():` blocks at end of output
- `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted

Tasks: T252981529
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 28, 2026

@tissue3 has imported this pull request. If you are a Meta employee, you can view this in D94788464.

Copy link
Copy Markdown
Contributor

@manman-ren manman-ren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Mar 5, 2026

@tissue3 merged this pull request in e3fc12a.

htyu pushed a commit that referenced this pull request Mar 5, 2026
Summary:
Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output:
- **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`)
- **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output
- **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops
- **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire`
Authored with Claude.

Pull Request resolved: #1005

Test Plan:
1. Build: `cd ~/triton; make`
2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt`
3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758)

Reviewed By: prithvip0524

Differential Revision: D94788464

Pulled By: tissue3

fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d
meta-codesync Bot pushed a commit that referenced this pull request Mar 14, 2026
… pass

Summary:
The PrintTTGIRToTLX debug pass mapped `ttg.memdesc_index` to
`tlx.memdesc_index`, but the correct TLX Python API name is
`tlx.local_view`. This change aligns the emitted pseudocode with the
actual TLX DSL, making the output closer to compilable TLX Python.

This is the rebased remainder of D94700558 after [PR #983](#983) and [PR #1005](#1005)
landed upstream. Those PRs already covered 6 of 7 changes (math op
mappings, unsigned div/rem, NaN-propagating min/max, skipping
gpu.barrier/convert_layout, convert layout transparent substitution via
getValueName, TMA descriptor, constant deduplication--obsolete as #983
inlines constants at use sites).

Rebased and authored with Claude.

Note that while the original intent of D94700558 was to ensure that generated
TLX from the TTGIR-to-TLX pass could compile, it did not achieve this yet. This
PR is to just get us up to speed with the earlier WIP, and then we'll post additional
fixes on top of this towards compilable TLX Python.

Differential Revision: D96554961
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants