Skip to content

Conversation

@ravil-mobile
Copy link
Contributor

@ravil-mobile ravil-mobile commented Aug 16, 2024

This PR adds verbosity to assembly code after LLVM backend passes.

Regarding the AMDGPU backend, the PR results in leaving references to the source code. For example:

.LBB0_11:                               ;   in Loop: Header=BB0_13 Depth=1
        .loc    1 131 18                        ; gemm.py:131:18
        v_lshl_add_u64 v[6:7], v[14:15], 0, v[18:19]
        .loc    1 126 22                        ; gemm.py:126:22
        v_lshl_add_u64 v[2:3], v[16:17], 0, v[18:19]
        .loc    1 128 20                        ; gemm.py:128:20
        global_load_dwordx4 v[2:5], v[2:3], off
        s_nop 0
        global_load_dwordx4 v[10:13], v[6:7], off
        .loc    1 127 20                        ; gemm.py:127:20
        s_nop 0
        global_load_dwordx4 v[6:9], v[20:21], off offset:32
        v_mov_b64_e32 v[24:25], v[22:23]

Additionally, the PR results in adding Kernel Info at the end of a file. For example:

; Kernel info:                                                                                                                                                                                                
; codeLenInByte = 7732                                                                                                                                                                                        
; NumSgprs: 24                                                                                                                                                                                                
; NumVgprs: 154                                                                                                                                                                                               
; NumAgprs: 128                                                                                                                                                                                               
; TotalNumVgprs: 284                                                                                                                                                                                          
; ScratchSize: 0                                                                                                                                                                                              
; MemoryBound: 1                                                                                                                                                                                              
; FloatMode: 240                                                                                                                                                                                              
; IeeeMode: 1                                                                                                                                                                                                 
; LDSByteSize: 0 bytes/workgroup (compile time only)                                                                                                                                                          
; SGPRBlocks: 2                                                                                                                                                                                               
; VGPRBlocks: 35                                                                                                                                                                                              
; NumSGPRsForWavesPerEU: 24                                                                                                                                                                                   
; NumVGPRsForWavesPerEU: 284                                                                                                                                                                                  
; AccumOffset: 156                                                                                                                                                                                            
; Occupancy: 1                                                                                                                                                                                                
; WaveLimiterHint : 0                                                                                                                                                                                         
; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0                                                                                                                                                                             
; COMPUTE_PGM_RSRC2:USER_SGPR: 2                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0                                                                                                                                                                           
; COMPUTE_PGM_RSRC2:TGID_X_EN: 1                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0                                                                                                                                                                              
; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0                                                                                                                                                                         
; COMPUTE_PGM_RSRC3_GFX90A:ACCUM_OFFSET: 38
; COMPUTE_PGM_RSRC3_GFX90A:TG_SPLIT: 0

@ravil-mobile ravil-mobile requested a review from ptillet as a code owner August 16, 2024 16:29
@antiagainst antiagainst marked this pull request as draft August 16, 2024 16:36
@zhanglx13
Copy link
Collaborator

@ravil-mobile Thank you for enabling this.
Can you do a quick check on the NV side to make sure this does not break anything when printing the sass?

@ravil-mobile
Copy link
Contributor Author

@ravil-mobile Thank you for enabling this. Can you do a quick check on the NV side to make sure this does not break anything when printing the sass?

Hi @zhanglx13. Checked. Everything works for the NV backend

@zhanglx13 zhanglx13 marked this pull request as ready for review August 22, 2024 13:16
@ThomasRaoux
Copy link
Collaborator

how does this impact the ptx and sass dumped on the nv path?

@ravil-mobile
Copy link
Contributor Author

how does this impact the ptx and sass dumped on the nv path?

Hi @ThomasRaoux

Regarding ptx, it just add comments about the source code location. For example,

        .reg .pred      %p<13>;                                                                                                                                                                               
        .reg .b32       %r<54>;                                                                                                                                                                               
        .reg .f32       %f<32>;                                                                                                                                                                               
        .reg .b64       %rd<10>;                                                                                                                                                                              
        .loc    1 15 0                          // softmax.py:15:0                                                                                                                                            
$L__func_begin0:                                                                                                                                                                                              
        .loc    1 15 0                          // softmax.py:15:0                                                                                                                                            
                                                                                                                                                                                                              
// %bb.0:                                                                                                                                                                                                     
        ld.param.u64    %rd3, [softmax_kernel_param_0];                                                                                                                                                       
        ld.param.u32    %r20, [softmax_kernel_param_1];                                                                                                                                                       
$L__tmp0:                                                                                                                                                                                                     
        .loc    1 16 22                         // softmax.py:16:22                                                                                                                                           
        // begin inline asm                                                                                                                                                                                   
        mov.u32 %r1, %ctaid.x;                                                                                                                                                                                
        // end inline asm                                                                                                                                                                                     
        .loc    1 17 30                         // softmax.py:17:30                                                                                                                                           
        mul.lo.s32      %r21, %r1, %r20;                                                                                                                                                                      
        ld.param.u64    %rd4, [softmax_kernel_param_2];                                                                                                                                                       
        ld.param.u32    %r22, [softmax_kernel_param_3];                                                                                                                                                       
        .loc    1 17 24                         // softmax.py:17:24                                                                                                                                           
        mul.wide.s32    %rd5, %r21, 4;                                                                                                                                                                        
        add.s64         %rd6, %rd3, %rd5;                                                                                                                                                                     
        ld.param.u32    %r23, [softmax_kernel_param_4];                                                                                                                                                       
        .loc    1 18 27                         // softmax.py:18:27                                                                                                                                           
        mov.u32         %r24, %tid.x;
        and.b32         %r25, %r24, 31;
        and.b32         %r26, %r24, 63;
        .loc    1 19 58                         // softmax.py:19:58

Regarding cubin, I mean the comments are just discarded. It is not going to affect performance.

I can also change the code to enable verbose outputting only for the AMD backend

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG to have uniformly.

@antiagainst antiagainst merged commit 1827757 into triton-lang:main Aug 27, 2024
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
)

This PR adds verbosity to assembly code after LLVM backend passes.

This adds references to the source code for both NV and AMD.
Additionally, it adds `Kernel Info` at the end of the dump for AMD.
For example:

```
; Kernel info:                                                                                                                                                                                                
; codeLenInByte = 7732                                                                                                                                                                                        
; NumSgprs: 24                                                                                                                                                                                                
; NumVgprs: 154                                                                                                                                                                                               
; NumAgprs: 128                                                                                                                                                                                               
; TotalNumVgprs: 284                                                                                                                                                                                          
...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants