Skip to content

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler#3835

Closed
manman-ren wants to merge 2 commits intotriton-lang:mainfrom
manman-ren:enable-remark
Closed

Enable remarks for ttgir lowering with SourceMgrDiagnosticHandler#3835
manman-ren wants to merge 2 commits intotriton-lang:mainfrom
manman-ren:enable-remark

Conversation

@manman-ren
Copy link
Copy Markdown
Collaborator

@manman-ren manman-ren commented May 3, 2024

This will print remarks like below
01-vector-add.py:48:16: remark: vec = 4
x = tl.load(x_ptr + offsets, mask=mask)
^
With printOpOnDiagnostic being true, it will show
01-vector-add.py:49:16: remark: vec = 4
y = tl.load(y_ptr + offsets, mask=mask)
^
01-vector-add.py:49:16: note: see current operation: %332 = "tt.load"(%315, %165) <{cache = 1 : i32, evict = 1 : i32, isVolatile = false, operandSegmentSizes = array<i32: 1, 1, 0>}> : (tensor<1024x!tt.ptr<f32, 1>, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>, tensor<1024xi1, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>) -> tensor<1024xf32, #triton_gpu.blocked<{sizePerThread = [4], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>>

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren manman-ren requested a review from ptillet as a code owner May 3, 2024 17:11
@manman-ren manman-ren marked this pull request as draft May 3, 2024 17:11
@manman-ren
Copy link
Copy Markdown
Collaborator Author

I am thinking about using remarks to capture key optimization information and the information can be integrated into a performance analysis tool (Proton may be a good choice). CC @Jokeren
I am interested in the following information, not sure which info will be from the compiler and which can be from Proton:

  • number of layout conversions, memory transfer size for each conversion
  • number of loads/stores, for each load, store, is it coalesced? Is it vectorized?
  • For each loop, is SWP efficiently enabled?
  • Which mma instruction is used for tl.dot? Will transposing make tl.dot eligible for mma v3?

Remarks can show location for both .py source and .ttgir file.

I haven't figured out how to filter Diagnostics for SourceMgrDiagnosticHandler. Currently we will also get some warnings:
01-vector-add.py:28:0: warning: Unhandled parameter attribute 'tt.divisibility'
If printOpOnDiagnostic is on, it will dump the whole function.

@manman-ren manman-ren requested a review from htyu May 3, 2024 17:21
@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 3, 2024

This is interesting. I need to think more about it though.

Could we keep it as a draft PR and come back discussing it later. I promise that I won't forget :)

@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 3, 2024

Saved it to my github folder.

@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 7, 2024

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

  1. We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.
  2. Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

@manman-ren
Copy link
Copy Markdown
Collaborator Author

Hi @manman-ren , I had a discussion with @htyu and understand the purpose of the PR. I'm very supportive on providing more metadata.

I have two suggestions.

  1. We could provide the kernel source file path to proton and let it analyzes the triton GPU IR using existing passes to recover information.
  2. Once we have all the information, we could first reuse our existing viewer to view at most two metrics on the terminal. Next, maybe it's time to start integrating it with some existing visualizers because we need to view multiple metrics and source lines.

I may spend some time working on this soon :) Happy to chat if you have any thoughts

Thanks for taking the time!
For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.
For 2> It will be great if we can have a visualizer to look at profiling data together with analysis data, we can have one tool for AMD/NV and also for PT2/Triton.

@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 7, 2024

For 1> My current plan is to emit optimization remarks (it will come with source line and ttgir line). Maybe Proton can parse the remarks. I am not sure how easy it is to let Proton run passes on GPU IR. I have a follow up patch that tries to add a pass to collect information and to emit optimization remarks from lowering ttgir to llvm.

Using a separate pass to get the information is better than instrumenting existing compiler passes.

But still, I'm worried about parsing the remark.

If instead we parse the GPU IR by calling module = ir.parse_mlir_module(full_name, context), and use an analysis function to analyze the IR, it will offer the following benefits:

  1. This function can utilize all existing analysis modules.
  2. Proton can get the parsed information in memory without parsing the file.
  3. The analysis function can be called as a python interface of triton and benefit other profilers or analyzers.

The design would be

class result:
    file: str
    line_no: int
    column_no: int
    remarks: dict[str, problem]

class func_result:
   results: list[result] 

class module_result:
    func_results: dict[str, func_result]
 
module = ir.parse_mlir_module(<gpu ir>)
analysis_results = libtriton.analyze_module(module)
libproton.inspect(analysis_results)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@manman-ren
Copy link
Copy Markdown
Collaborator Author

Yeah, I added a separate pass but there are decisions made during the lowering phase, such as vectorization, if a load instruction will be coalesced etc. For those, I am hoping to use optimization remarks.
#3853 is adding a perf-collection pass. It is in draft mode, mostly for discussion purpose.

// Check to see if this op is coalesced. Depending on vectorization, we
// can have varying number of instructions to perform the load. Each
// instruction handles vec elements. Each thread handles numElems, and
// next thread handles vec elements starting with sizePerThread.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jokeren @htyu Is it possible to get this information at ttgir level?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@manman-ren
Copy link
Copy Markdown
Collaborator Author

Discussions are now in the new PR: #3922
closing this

@manman-ren manman-ren closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants