Skip to content

[TLX] Improve PrintTTGIRToTLX output readability#983

Closed
tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
tissue3:tlx_print_inline_constant
Closed

[TLX] Improve PrintTTGIRToTLX output readability#983
tissue3 wants to merge 3 commits intofacebookexperimental:mainfrom
tissue3:tlx_print_inline_constant

Conversation

@tissue3
Copy link
Copy Markdown
Contributor

@tissue3 tissue3 commented Feb 26, 2026

Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

  1. Inlining constants at use sites - Instead of emitting c32_i32 = 32 and referencing c32_i32, constants are now inlined directly as 32 where used. This significantly reduces noise from constant definitions.

  2. Skipping non-meaningful operations - Operations that don't contribute to TLX understanding are now filtered out:

    • gpu.barrier - not needed in TLX
    • ttg.convert_layout - internal layout conversion
    • tt.return / tt.reduce.return - terminators
    • Various warp specialization internals already skipped
  3. Skipping empty async_task blocks - Partition regions that only contain skipped operations (like a single tt.return) are now omitted, eliminating empty with tlx.async_task(): blocks.

  4. Refactored skip logic - Replaced individual if statements with a llvm::StringSet<> lookup for cleaner, more maintainable code.

Test Plan:

  1. Generated fwd.txt output using:
 WITH_OSS_WARPSPEC=1 TRITON_USE_META_PARTITION=1 TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force
  1. Verified:
  • Constants are inlined (e.g., mul(arg2, 32) instead of mul(arg2, c32_i32))
  • No empty with tlx.async_task(): blocks at end of output
  • gpu.barrier, ttg.convert_layout, tt.return are not emitted
    Before the change P2205734619
    After the change P2208628663

Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions.

2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out:
   - `gpu.barrier` - not needed in TLX
   - `ttg.convert_layout` - internal layout conversion
   - `tt.return` / `tt.reduce.return` - terminators
   - Various warp specialization internals already skipped

3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks.

4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code.

Authored with Claude.

Test Plan:
1. Generated fwd.txt output using:
```
TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ...
```

2. Verified:
- Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`)
- No empty `with tlx.async_task():` blocks at end of output
- `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted

Tasks: T252981529
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 26, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 26, 2026

@tissue3 has imported this pull request. If you are a Meta employee, you can view this in D94436902.

Copy link
Copy Markdown
Contributor

@manman-ren manman-ren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions.

2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out:
   - `gpu.barrier` - not needed in TLX
   - `ttg.convert_layout` - internal layout conversion
   - `tt.return` / `tt.reduce.return` - terminators
   - Various warp specialization internals already skipped

3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks.

4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code.

Authored with Claude.

Test Plan:
1. Generated fwd.txt output using:
```
TRITON_DUMP_TTGIR_TO_TLX=1 python tritonbench/run.py --op blackwell_attentions ...
```

2. Verified:
- Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`)
- No empty `with tlx.async_task():` blocks at end of output
- `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted

Tasks: T252981529
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Feb 28, 2026

@tissue3 merged this pull request in 4867b62.

htyu pushed a commit that referenced this pull request Mar 3, 2026
Summary:
This change improves the TLX emission pass to produce cleaner, more readable output by:

1. **Inlining constants at use sites** - Instead of emitting `c32_i32 = 32` and referencing `c32_i32`, constants are now inlined directly as `32` where used. This significantly reduces noise from constant definitions.

2. **Skipping non-meaningful operations** - Operations that don't contribute to TLX understanding are now filtered out:
   - `gpu.barrier` - not needed in TLX
   - `ttg.convert_layout` - internal layout conversion
   - `tt.return` / `tt.reduce.return` - terminators
   - Various warp specialization internals already skipped

3. **Skipping empty async_task blocks** - Partition regions that only contain skipped operations (like a single `tt.return`) are now omitted, eliminating empty `with tlx.async_task():` blocks.

4. **Refactored skip logic** - Replaced individual `if` statements with a `llvm::StringSet<>` lookup for cleaner, more maintainable code.

Pull Request resolved: #983

Test Plan:
1. Generated fwd.txt output using:
```
TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force
```

2. Verified:
- Constants are inlined (e.g., `mul(arg2, 32)` instead of `mul(arg2, c32_i32)`)
- No empty `with tlx.async_task():` blocks at end of output
- `gpu.barrier`, `ttg.convert_layout`, `tt.return` are not emitted
Before the change [P2205734619](https://www.internalfb.com/phabricator/paste/view/P2205734619)
After the change [P2208628663](https://www.internalfb.com/phabricator/paste/view/P2208628663)

Reviewed By: jma2333

Differential Revision: D94436902

Pulled By: tissue3

fbshipit-source-id: ddaf3e9d939b25573b2d3cac400bccae3516df44
meta-codesync Bot pushed a commit that referenced this pull request Mar 5, 2026
Summary:
Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output:
- **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`)
- **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output
- **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops
- **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire`
Authored with Claude.

Pull Request resolved: #1005

Test Plan:
1. Build: `cd ~/triton; make`
2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt`
3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758)

Reviewed By: prithvip0524

Differential Revision: D94788464

Pulled By: tissue3

fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d
htyu pushed a commit that referenced this pull request Mar 5, 2026
Summary:
Follow-up to #983. Further improves the `PrintTTGIRToTLX` pass to produce more Python-like output:
- **Comparison inlining**: `cmpi`/`cmpf` printed as `<`, `>=`, `==` etc. and inlined into `if`/`while` conditions (e.g., `if var_109 < var_105:`)
- **Cast transparency**: Type casts (`extui`, `trunci`, `truncf`, `sitofp`, `bitcast`, etc.) resolved through to source operand and skipped in output
- **Structured control flow**: `cf.br`/`cf.cond_br` converted to `if`/`else`, `while`, and `for var in tl.range(start, end[, step]):` loops
- **TMA op mappings**: Added `ttng.tensormap_create`, `ttg.global_scratch_alloc`, `ttng.tensormap_fenceproxy_acquire`
Authored with Claude.

Pull Request resolved: #1005

Test Plan:
1. Build: `cd ~/triton; make`
2. Generate output with `TRITON_TLX_OUTPUT_FILE=~/tritonbench/output.py TRITON_TLX_COMPILABLE=1 TRITON_DUMP_TTGIR_TO_TLX=1 TRITON_ALWAYS_COMPILE=1 TRITON_KERNEL_DUMP=1 TRITON_DUMP_DIR=/tmp/triton_tissue030 TRITON_USE_META_WS=1 TRITON_PRINT_AUTOTUNING=1 CUDA_VISIBLE_DEVICES=3 bash ~/fbsource/fbcode/ads_mkl/benchmarks/denoise.sh python run.py --op blackwell_attentions --seq-len 8192 --batch 4 --n-heads 32 --d-head 128 --rep 3000 --sleep 1.0 --metrics tflops --simple-output --only triton_tutorial_flash_persistent_blackwell --force 2>&1 | tee fwd.txt`
3. Generated fwd.txt: [P2209959758](https://www.internalfb.com/phabricator/paste/view/P2209959758)

Reviewed By: prithvip0524

Differential Revision: D94788464

Pulled By: tissue3

fbshipit-source-id: 46efb5524daf465554918f430d46fd71c02e7f7d
meta-codesync Bot pushed a commit that referenced this pull request Mar 14, 2026
… pass

Summary:
The PrintTTGIRToTLX debug pass mapped `ttg.memdesc_index` to
`tlx.memdesc_index`, but the correct TLX Python API name is
`tlx.local_view`. This change aligns the emitted pseudocode with the
actual TLX DSL, making the output closer to compilable TLX Python.

This is the rebased remainder of D94700558 after [PR #983](#983) and [PR #1005](#1005)
landed upstream. Those PRs already covered 6 of 7 changes (math op
mappings, unsigned div/rem, NaN-propagating min/max, skipping
gpu.barrier/convert_layout, convert layout transparent substitution via
getValueName, TMA descriptor, constant deduplication--obsolete as #983
inlines constants at use sites).

Rebased and authored with Claude.

Note that while the original intent of D94700558 was to ensure that generated
TLX from the TTGIR-to-TLX pass could compile, it did not achieve this yet. This
PR is to just get us up to speed with the earlier WIP, and then we'll post additional
fixes on top of this towards compilable TLX Python.

Differential Revision: D96554961
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants