Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler#3835
Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler#3835manman-ren wants to merge 2 commits intotriton-lang:mainfrom
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
|
I am thinking about using remarks to capture key optimization information and the information can be integrated into a performance analysis tool (Proton may be a good choice). CC @Jokeren
Remarks can show location for both .py source and .ttgir file. I haven't figured out how to filter Diagnostics for SourceMgrDiagnosticHandler. Currently we will also get some warnings: |
|
This is interesting. I need to think more about it though. Could we keep it as a draft PR and come back discussing it later. I promise that I won't forget :) |
|
Saved it to my github folder. |
|
Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata. I have two suggestions.
I may spend some time working on this soon :) Happy to chat if you have any thoughts |
Thanks for taking the time! |
Using a separate pass to get the information is better than instrumenting existing compiler passes. But still, I'm worried about parsing the remark. If instead we parse the GPU IR by calling
The design would be |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
|
Yeah, I added a separate pass but there are decisions made during the lowering phase, such as vectorization, if a load instruction will be coalesced etc. For those, I am hoping to use optimization remarks. |
| // Check to see if this op is coalesced. Depending on vectorization, we | ||
| // can have varying number of instructions to perform the load. Each | ||
| // instruction handles vec elements. Each thread handles numElems, and | ||
| // next thread handles vec elements starting with sizePerThread. |
|
Discussions are now in the new PR: #3922 |
This will print remarks like below
01-vector-add.py:48:16: remark: vec = 4
x = tl.load(x_ptr + offsets, mask=mask)
^
With printOpOnDiagnostic being true, it will show
01-vector-add.py:49:16: remark: vec = 4
y = tl.load(y_ptr + offsets, mask=mask)
^
01-vector-add.py:49:16: note: see current operation: %332 = "tt.load"(%315, %165) <{cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 1, 0>}> : (tensor<1024x!tt.ptr<f32, 1>, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>, tensor<1024xi1, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>) -> tensor<1024xf32, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>