Add tensor core GQA dispatch for `[4,5,6,8]` #1258

lzhangzz · 2024-03-06T10:40:42Z

Benchmarks results of various CTA_H

A100 80G, batch size 128, seq len 1024, w/o split-kv, in microseconds

32:8	1	2	4
SIMT	513.70	316.26	319.70
TC	509.66	318.91	308.35

48:8	1	2	3	6
SIMT	755.17	382.27	354.37	367.46^*
TC	756.35	401.28	319.58	309.31

64:8	1	2	4	8
SIMT	1020	497.92	406.27	985.82^*
TC	993.98	517.86	319.39	305.12

80:8	1	2	5
SIMT	1260	616	485.70
TC	1230	635.52	318.75

^*register spill

Conclusion: use TC when head_num / kv_head_num > 2

zhyncs · 2024-03-06T15:40:10Z

jjjjohnson · 2024-03-07T08:39:22Z

what does TC stands for?

zhyncs · 2024-03-07T08:40:21Z

what does TC stands for?

tensor core

tensor core GQA dispatch for [4,5,6,8]

b85f8f3

lzhangzz added enhancement New feature or request turbomind labels Mar 6, 2024

lvhan028 approved these changes Mar 11, 2024

View reviewed changes

lvhan028 merged commit 331858b into InternLM:turbomind-2.1 Mar 11, 2024
9 checks passed

lvhan028 mentioned this pull request Jun 21, 2024

[Docs] 吞吐的提升主要是因为重写了GQA的kernel？ #1785

Open

Provide feedback