[TUTORIAL] Remove grouped gemm simulation from 09-persistent-matmul by peterbell10 · Pull Request #5461 · triton-lang/triton

peterbell10 · 2024-12-19T16:15:03Z

As discussed in the multi-buffering PR, the persistent matmul should be kept as an apples-to-apples performance comparison. In particular, the existing perf results make tensor-descriptors look bad. With this updated tutorial I get results like (K=4096, prec=fp8):

├─ 1278.215 4731.062 cublas [M=8192, N=8192, K=4096]
│  └─ nan 4731.062 sm90_xmma_gemm_e4m3e4m3_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize1x1x1_bias_f16_execute_segment_k_off_kernel__5x_cublas
├─ 1208.855 454.774 matmul_kernel [M=8192, N=8192, K=4096]
├─ 1285.360 427.706 matmul_kernel_persistent [M=8192, N=8192, K=4096]
├─ 1330.667 413.143 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=4096]
└─ 1347.254 408.057 matmul_kernel_tma_persistent [M=8192, N=8192, K=4096]

So on H100 tensor descriptor is a 3.5% flops uplift over the plain persistent matmul vs. 4.8% for host-side TMA.

For the same shapes with fp16 I see a 13% uplift from tensor descriptor vs. 13.4% from host-side TMA.

As discussed in the [multi-buffering PR], the persistent matmul should be kept as an apples-to-apples performance comparison. In particular, the existing perf results makes tensor-descriptor look bad. With this updated tutorial I get results like (`K=4096, prec=fp8`): ``` ├─ 1278.215 4731.062 cublas [M=8192, N=8192, K=4096] │ └─ nan 4731.062 sm90_xmma_gemm_e4m3e4m3_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize1x1x1_bias_f16_execute_segment_k_off_kernel__5x_cublas ├─ 1208.855 454.774 matmul_kernel [M=8192, N=8192, K=4096] ├─ 1285.360 427.706 matmul_kernel_persistent [M=8192, N=8192, K=4096] ├─ 1330.667 413.143 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=4096] └─ 1347.254 408.057 matmul_kernel_tma_persistent [M=8192, N=8192, K=4096] ``` So on H100 tensor descriptor is a 3.5% flops uplift over the plain persistent matmul vs. 4.8% for host-side TMA. For the same shapes with fp16 I see a 13% uplift from tensor descriptor vs. 13.4% from host-side TMA. [multi-buffering PR]: #5290 (comment)

pawelszczerbuk

Looks good, thanks!

peterbell10 requested a review from pawelszczerbuk December 19, 2024 16:15

peterbell10 requested a review from ptillet as a code owner December 19, 2024 16:15

pawelszczerbuk approved these changes Dec 19, 2024

View reviewed changes

peterbell10 merged commit d1e0731 into main Dec 19, 2024

peterbell10 deleted the pb/tma-tutorial branch December 19, 2024 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TUTORIAL] Remove grouped gemm simulation from 09-persistent-matmul#5461

[TUTORIAL] Remove grouped gemm simulation from 09-persistent-matmul#5461
peterbell10 merged 1 commit intomainfrom
pb/tma-tutorial

peterbell10 commented Dec 19, 2024

Uh oh!

pawelszczerbuk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peterbell10 commented Dec 19, 2024

Uh oh!

pawelszczerbuk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants