Enable verbose asm #4528

ravil-mobile · 2024-08-16T16:29:32Z

This PR adds verbosity to assembly code after LLVM backend passes.

Regarding the AMDGPU backend, the PR results in leaving references to the source code. For example:

.LBB0_11:                               ;   in Loop: Header=BB0_13 Depth=1
        .loc    1 131 18                        ; gemm.py:131:18
        v_lshl_add_u64 v[6:7], v[14:15], 0, v[18:19]
        .loc    1 126 22                        ; gemm.py:126:22
        v_lshl_add_u64 v[2:3], v[16:17], 0, v[18:19]
        .loc    1 128 20                        ; gemm.py:128:20
        global_load_dwordx4 v[2:5], v[2:3], off
        s_nop 0
        global_load_dwordx4 v[10:13], v[6:7], off
        .loc    1 127 20                        ; gemm.py:127:20
        s_nop 0
        global_load_dwordx4 v[6:9], v[20:21], off offset:32
        v_mov_b64_e32 v[24:25], v[22:23]

Additionally, the PR results in adding Kernel Info at the end of a file. For example:

; Kernel info:                                                                                                                                                                                                
; codeLenInByte = 7732                                                                                                                                                                                        
; NumSgprs: 24                                                                                                                                                                                                
; NumVgprs: 154                                                                                                                                                                                               
; NumAgprs: 128                                                                                                                                                                                               
; TotalNumVgprs: 284                                                                                                                                                                                          
; ScratchSize: 0                                                                                                                                                                                              
; MemoryBound: 1                                                                                                                                                                                              
; FloatMode: 240                                                                                                                                                                                              
; IeeeMode: 1                                                                                                                                                                                                 
; LDSByteSize: 0 bytes/workgroup (compile time only)                                                                                                                                                          
; SGPRBlocks: 2                                                                                                                                                                                               
; VGPRBlocks: 35                                                                                                                                                                                              
; NumSGPRsForWavesPerEU: 24                                                                                                                                                                                   
; NumVGPRsForWavesPerEU: 284                                                                                                                                                                                  
; AccumOffset: 156                                                                                                                                                                                            
; Occupancy: 1                                                                                                                                                                                                
; WaveLimiterHint : 0                                                                                                                                                                                         
; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0                                                                                                                                                                             
; COMPUTE_PGM_RSRC2:USER_SGPR: 2                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0                                                                                                                                                                           
; COMPUTE_PGM_RSRC2:TGID_X_EN: 1                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0                                                                                                                                                                         
; COMPUTE_PGM_RSRC3_GFX90A:ACCUM_OFFSET: 38
; COMPUTE_PGM_RSRC3_GFX90A:TG_SPLIT: 0

zhanglx13 · 2024-08-20T22:06:56Z

@ravil-mobile Thank you for enabling this.
Can you do a quick check on the NV side to make sure this does not break anything when printing the sass?

ravil-mobile · 2024-08-22T13:09:36Z

@ravil-mobile Thank you for enabling this. Can you do a quick check on the NV side to make sure this does not break anything when printing the sass?

Hi @zhanglx13. Checked. Everything works for the NV backend

ThomasRaoux · 2024-08-22T14:26:28Z

how does this impact the ptx and sass dumped on the nv path?

ravil-mobile · 2024-08-27T09:45:49Z

how does this impact the ptx and sass dumped on the nv path?

Hi @ThomasRaoux

Regarding ptx, it just add comments about the source code location. For example,

        .reg .pred      %p<13>;                                                                                                                                                                               
        .reg .b32       %r<54>;                                                                                                                                                                               
        .reg .f32       %f<32>;                                                                                                                                                                               
        .reg .b64       %rd<10>;                                                                                                                                                                              
        .loc    1 15 0                          // softmax.py:15:0                                                                                                                                            
$L__func_begin0:                                                                                                                                                                                              
        .loc    1 15 0                          // softmax.py:15:0                                                                                                                                            
                                                                                                                                                                                                              
// %bb.0:                                                                                                                                                                                                     
        ld.param.u64    %rd3, [softmax_kernel_param_0];                                                                                                                                                       
        ld.param.u32    %r20, [softmax_kernel_param_1];                                                                                                                                                       
$L__tmp0:                                                                                                                                                                                                     
        .loc    1 16 22                         // softmax.py:16:22                                                                                                                                           
        // begin inline asm                                                                                                                                                                                   
        mov.u32 %r1, %ctaid.x;                                                                                                                                                                                
        // end inline asm                                                                                                                                                                                     
        .loc    1 17 30                         // softmax.py:17:30                                                                                                                                           
        mul.lo.s32      %r21, %r1, %r20;                                                                                                                                                                      
        ld.param.u64    %rd4, [softmax_kernel_param_2];                                                                                                                                                       
        ld.param.u32    %r22, [softmax_kernel_param_3];                                                                                                                                                       
        .loc    1 17 24                         // softmax.py:17:24                                                                                                                                           
        mul.wide.s32    %rd5, %r21, 4;                                                                                                                                                                        
        add.s64         %rd6, %rd3, %rd5;                                                                                                                                                                     
        ld.param.u32    %r23, [softmax_kernel_param_4];                                                                                                                                                       
        .loc    1 18 27                         // softmax.py:18:27                                                                                                                                           
        mov.u32         %r24, %tid.x;
        and.b32         %r25, %r24, 31;
        and.b32         %r26, %r24, 63;
        .loc    1 19 58                         // softmax.py:19:58

Regarding cubin, I mean the comments are just discarded. It is not going to affect performance.

I can also change the code to enable verbose outputting only for the AMD backend

antiagainst

LG to have uniformly.

) This PR adds verbosity to assembly code after LLVM backend passes. This adds references to the source code for both NV and AMD. Additionally, it adds `Kernel Info` at the end of the dump for AMD. For example: ``` ; Kernel info: ; codeLenInByte = 7732 ; NumSgprs: 24 ; NumVgprs: 154 ; NumAgprs: 128 ; TotalNumVgprs: 284 ... ```

ravil-mobile requested a review from ptillet as a code owner August 16, 2024 16:29

antiagainst marked this pull request as draft August 16, 2024 16:36

zhanglx13 approved these changes Aug 20, 2024

View reviewed changes

ravil-mobile force-pushed the ravil/verbose-asm branch from aa27c02 to 76b664b Compare August 22, 2024 13:08

zhanglx13 marked this pull request as ready for review August 22, 2024 13:16

Enable verbose asm

5889d97

ravil-mobile force-pushed the ravil/verbose-asm branch from 76b664b to 5889d97 Compare August 27, 2024 09:50

antiagainst approved these changes Aug 27, 2024

View reviewed changes

antiagainst merged commit 1827757 into triton-lang:main Aug 27, 2024

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable verbose asm #4528

Enable verbose asm #4528

Uh oh!

ravil-mobile commented Aug 16, 2024 •

edited

Loading

Uh oh!

zhanglx13 commented Aug 20, 2024

Uh oh!

ravil-mobile commented Aug 22, 2024

Uh oh!

ThomasRaoux commented Aug 22, 2024

Uh oh!

ravil-mobile commented Aug 27, 2024

Uh oh!

antiagainst left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable verbose asm #4528

Enable verbose asm #4528

Uh oh!

Conversation

ravil-mobile commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanglx13 commented Aug 20, 2024

Uh oh!

ravil-mobile commented Aug 22, 2024

Uh oh!

ThomasRaoux commented Aug 22, 2024

Uh oh!

ravil-mobile commented Aug 27, 2024

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ravil-mobile commented Aug 16, 2024 •

edited

Loading