Skip to content

[TUTORIAL] Remove grouped gemm simulation from 09-persistent-matmul#5461

Merged
peterbell10 merged 1 commit intomainfrom
pb/tma-tutorial
Dec 19, 2024
Merged

[TUTORIAL] Remove grouped gemm simulation from 09-persistent-matmul#5461
peterbell10 merged 1 commit intomainfrom
pb/tma-tutorial

Conversation

@peterbell10
Copy link
Copy Markdown
Contributor

As discussed in the multi-buffering PR, the persistent matmul should be kept as an apples-to-apples performance comparison. In particular, the existing perf results make tensor-descriptors look bad. With this updated tutorial I get results like (K=4096, prec=fp8):

├─ 1278.215 4731.062 cublas [M=8192, N=8192, K=4096]
│  └─ nan 4731.062 sm90_xmma_gemm_e4m3e4m3_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize1x1x1_bias_f16_execute_segment_k_off_kernel__5x_cublas
├─ 1208.855 454.774 matmul_kernel [M=8192, N=8192, K=4096]
├─ 1285.360 427.706 matmul_kernel_persistent [M=8192, N=8192, K=4096]
├─ 1330.667 413.143 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=4096]
└─ 1347.254 408.057 matmul_kernel_tma_persistent [M=8192, N=8192, K=4096]

So on H100 tensor descriptor is a 3.5% flops uplift over the plain persistent matmul vs. 4.8% for host-side TMA.

For the same shapes with fp16 I see a 13% uplift from tensor descriptor vs. 13.4% from host-side TMA.

As discussed in the [multi-buffering PR], the persistent matmul should
be kept as an apples-to-apples performance comparison. In particular,
the existing perf results makes tensor-descriptor look bad. With this
updated tutorial I get results like (`K=4096, prec=fp8`):
```
├─ 1278.215 4731.062 cublas [M=8192, N=8192, K=4096]
│  └─ nan 4731.062 sm90_xmma_gemm_e4m3e4m3_e4m3f32_f32_tn_n_tilesize128x128x128_warpgroupsize1x1x1_bias_f16_execute_segment_k_off_kernel__5x_cublas
├─ 1208.855 454.774 matmul_kernel [M=8192, N=8192, K=4096]
├─ 1285.360 427.706 matmul_kernel_persistent [M=8192, N=8192, K=4096]
├─ 1330.667 413.143 matmul_kernel_descriptor_persistent [M=8192, N=8192, K=4096]
└─ 1347.254 408.057 matmul_kernel_tma_persistent [M=8192, N=8192, K=4096]
```

So on H100 tensor descriptor is a 3.5% flops uplift over the plain
persistent matmul vs. 4.8% for host-side TMA.

For the same shapes with fp16 I see a 13% uplift from tensor descriptor vs.
13.4% from host-side TMA.

[multi-buffering PR]: #5290 (comment)
Copy link
Copy Markdown
Contributor

@pawelszczerbuk pawelszczerbuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@peterbell10 peterbell10 merged commit d1e0731 into main Dec 19, 2024
@peterbell10 peterbell10 deleted the pb/tma-tutorial branch December 19, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants