[gluon][examples] MoE bmm1 in Gluon#10047
Merged
Merged
Conversation
lezcano
reviewed
Apr 16, 2026
ThomasRaoux
approved these changes
Apr 17, 2026
raymondtay
pushed a commit
to raymondtay/triton
that referenced
this pull request
Apr 18, 2026
The main beneficial optimizations are:
1. Separate loader for weight and weight scales, allowing asymmetric
pipelining. Potentially we can have a separate partition for scale
factor load as well, but I didn't experiment with it.
2. Epilogue optimizations to decrease the number of instructions.
Epilogue instruction issue (especially with 8 warps) staves the other
warps of instruction issuing. Optimizing it improves overall
performance. I repurposed the 2 idle leftover warps as a store partition
to decrease the critical path in the epilogue.
gpt-oss-120b shapes, performance vs. `triton_kernels.matmul` on
synthetic "realistic" logits.
```
GPT-OSS-120B MoE MM1 E=128 EP=4 ES=8 B=2880x5760
Peak: 5 PFLOPS, 8 TBPS
batch_size example reference
----------------------------------------------------------------------------------------------------------------
128 29.12 TFLOPS ( 0.6%) 5.27 TBPS ( 65.9%) 22.74 TFLOPS ( 0.5%) 4.11 TBPS ( 51.4%)
160 36.67 TFLOPS ( 0.7%) 5.96 TBPS ( 74.5%) 27.66 TFLOPS ( 0.6%) 4.49 TBPS ( 56.2%)
192 39.57 TFLOPS ( 0.8%) 6.00 TBPS ( 75.0%) 28.94 TFLOPS ( 0.6%) 4.39 TBPS ( 54.9%)
224 52.64 TFLOPS ( 1.1%) 5.99 TBPS ( 74.9%) 39.21 TFLOPS ( 0.8%) 4.46 TBPS ( 55.8%)
256 55.48 TFLOPS ( 1.1%) 6.46 TBPS ( 80.7%) 40.17 TFLOPS ( 0.8%) 4.67 TBPS ( 58.4%)
320 67.89 TFLOPS ( 1.4%) 6.64 TBPS ( 83.0%) 41.19 TFLOPS ( 0.8%) 4.03 TBPS ( 50.4%)
384 73.96 TFLOPS ( 1.5%) 6.43 TBPS ( 80.4%) 45.74 TFLOPS ( 0.9%) 3.98 TBPS ( 49.7%)
448 91.18 TFLOPS ( 1.8%) 6.64 TBPS ( 83.0%) 53.82 TFLOPS ( 1.1%) 3.92 TBPS ( 49.0%)
512 78.16 TFLOPS ( 1.6%) 5.36 TBPS ( 67.0%) 60.78 TFLOPS ( 1.2%) 4.17 TBPS ( 52.1%)
640 90.66 TFLOPS ( 1.8%) 5.17 TBPS ( 64.6%) 58.41 TFLOPS ( 1.2%) 3.33 TBPS ( 41.6%)
768 93.65 TFLOPS ( 1.9%) 4.98 TBPS ( 62.3%) 61.17 TFLOPS ( 1.2%) 3.25 TBPS ( 40.7%)
896 103.66 TFLOPS ( 2.1%) 5.04 TBPS ( 63.1%) 66.78 TFLOPS ( 1.3%) 3.25 TBPS ( 40.6%)
1024 135.60 TFLOPS ( 2.7%) 5.16 TBPS ( 64.5%) 86.31 TFLOPS ( 1.7%) 3.28 TBPS ( 41.0%)
1280 131.21 TFLOPS ( 2.6%) 3.80 TBPS ( 47.5%) 119.41 TFLOPS ( 2.4%) 3.46 TBPS ( 43.3%)
1536 175.65 TFLOPS ( 3.5%) 4.26 TBPS ( 53.3%) 145.39 TFLOPS ( 2.9%) 3.53 TBPS ( 44.1%)
1792 181.42 TFLOPS ( 3.6%) 3.85 TBPS ( 48.1%) 148.73 TFLOPS ( 3.0%) 3.15 TBPS ( 39.4%)
2048 188.32 TFLOPS ( 3.8%) 3.96 TBPS ( 49.5%) 165.47 TFLOPS ( 3.3%) 3.48 TBPS ( 43.5%)
2560 205.57 TFLOPS ( 4.1%) 3.45 TBPS ( 43.2%) 177.60 TFLOPS ( 3.6%) 2.98 TBPS ( 37.3%)
3072 263.31 TFLOPS ( 5.3%) 3.62 TBPS ( 45.2%) 253.05 TFLOPS ( 5.1%) 3.48 TBPS ( 43.5%)
3584 283.21 TFLOPS ( 5.7%) 3.38 TBPS ( 42.2%) 290.26 TFLOPS ( 5.8%) 3.46 TBPS ( 43.2%)
4096 348.60 TFLOPS ( 7.0%) 3.61 TBPS ( 45.2%) 329.95 TFLOPS ( 6.6%) 3.42 TBPS ( 42.8%)
5120 421.37 TFLOPS ( 8.4%) 3.57 TBPS ( 44.6%) 385.63 TFLOPS ( 7.7%) 3.27 TBPS ( 40.8%)
6144 524.65 TFLOPS ( 10.5%) 3.69 TBPS ( 46.1%) 473.72 TFLOPS ( 9.5%) 3.33 TBPS ( 41.6%)
7168 626.20 TFLOPS ( 12.5%) 3.79 TBPS ( 47.4%) 582.67 TFLOPS ( 11.7%) 3.53 TBPS ( 44.1%)
8192 675.19 TFLOPS ( 13.5%) 3.64 TBPS ( 45.6%) 629.79 TFLOPS ( 12.6%) 3.40 TBPS ( 42.5%)
9216 729.35 TFLOPS ( 14.6%) 3.62 TBPS ( 45.2%) 674.98 TFLOPS ( 13.5%) 3.35 TBPS ( 41.9%)
10240 749.61 TFLOPS ( 15.0%) 3.40 TBPS ( 42.4%) 685.04 TFLOPS ( 13.7%) 3.10 TBPS ( 38.8%)
11264 806.46 TFLOPS ( 16.1%) 3.34 TBPS ( 41.8%) 721.29 TFLOPS ( 14.4%) 2.99 TBPS ( 37.3%)
12288 895.78 TFLOPS ( 17.9%) 3.38 TBPS ( 42.3%) 805.38 TFLOPS ( 16.1%) 3.04 TBPS ( 38.0%)
13312 1008.57 TFLOPS ( 20.2%) 3.53 TBPS ( 44.1%) 921.30 TFLOPS ( 18.4%) 3.22 TBPS ( 40.3%)
14336 1001.71 TFLOPS ( 20.0%) 3.24 TBPS ( 40.5%) 915.14 TFLOPS ( 18.3%) 2.96 TBPS ( 37.0%)
15360 1108.54 TFLOPS ( 22.2%) 3.35 TBPS ( 41.8%) 1020.08 TFLOPS ( 20.4%) 3.08 TBPS ( 38.5%)
16384 1182.28 TFLOPS ( 23.6%) 3.41 TBPS ( 42.6%) 1083.60 TFLOPS ( 21.7%) 3.12 TBPS ( 39.0%)
17408 1255.35 TFLOPS ( 25.1%) 3.44 TBPS ( 43.0%) 1106.65 TFLOPS ( 22.1%) 3.03 TBPS ( 37.9%)
18432 1339.98 TFLOPS ( 26.8%) 3.50 TBPS ( 43.7%) 1188.63 TFLOPS ( 23.8%) 3.10 TBPS ( 38.8%)
19456 1411.14 TFLOPS ( 28.2%) 3.49 TBPS ( 43.6%) 1239.33 TFLOPS ( 24.8%) 3.06 TBPS ( 38.3%)
20480 1465.49 TFLOPS ( 29.3%) 3.46 TBPS ( 43.3%) 1329.22 TFLOPS ( 26.6%) 3.14 TBPS ( 39.2%)
21504 1429.77 TFLOPS ( 28.6%) 3.23 TBPS ( 40.3%) 1228.47 TFLOPS ( 24.6%) 2.77 TBPS ( 34.7%)
22528 1432.90 TFLOPS ( 28.7%) 3.08 TBPS ( 38.5%) 1283.04 TFLOPS ( 25.7%) 2.76 TBPS ( 34.5%)
23552 1530.40 TFLOPS ( 30.6%) 3.17 TBPS ( 39.6%) 1304.38 TFLOPS ( 26.1%) 2.70 TBPS ( 33.8%)
24576 1531.43 TFLOPS ( 30.6%) 3.05 TBPS ( 38.1%) 1378.61 TFLOPS ( 27.6%) 2.74 TBPS ( 34.3%)
25600 1609.22 TFLOPS ( 32.2%) 3.09 TBPS ( 38.7%) 1403.54 TFLOPS ( 28.1%) 2.70 TBPS ( 33.7%)
26624 1591.86 TFLOPS ( 31.8%) 2.96 TBPS ( 37.0%) 1461.78 TFLOPS ( 29.2%) 2.72 TBPS ( 34.0%)
27648 1708.82 TFLOPS ( 34.2%) 3.08 TBPS ( 38.5%) 1554.22 TFLOPS ( 31.1%) 2.80 TBPS ( 35.0%)
28672 1715.65 TFLOPS ( 34.3%) 3.00 TBPS ( 37.5%) 1582.07 TFLOPS ( 31.6%) 2.77 TBPS ( 34.6%)
29696 1756.95 TFLOPS ( 35.1%) 2.98 TBPS ( 37.3%) 1562.01 TFLOPS ( 31.2%) 2.65 TBPS ( 33.1%)
30720 1801.20 TFLOPS ( 36.0%) 2.97 TBPS ( 37.2%) 1681.00 TFLOPS ( 33.6%) 2.77 TBPS ( 34.7%)
31744 1979.10 TFLOPS ( 39.6%) 3.17 TBPS ( 39.7%) 1754.99 TFLOPS ( 35.1%) 2.82 TBPS ( 35.2%)
```
And on uniform logits
```
batch_size example reference
----------------------------------------------------------------------------------------------------------------
128 73.67 TFLOPS ( 1.5%) 5.41 TBPS ( 67.7%) 60.70 TFLOPS ( 1.2%) 4.46 TBPS ( 55.7%)
160 96.46 TFLOPS ( 1.9%) 5.50 TBPS ( 68.7%) 78.10 TFLOPS ( 1.6%) 4.45 TBPS ( 55.6%)
192 115.43 TFLOPS ( 2.3%) 5.43 TBPS ( 67.9%) 94.71 TFLOPS ( 1.9%) 4.45 TBPS ( 55.7%)
224 141.31 TFLOPS ( 2.8%) 5.47 TBPS ( 68.3%) 115.26 TFLOPS ( 2.3%) 4.46 TBPS ( 55.7%)
256 155.77 TFLOPS ( 3.1%) 5.47 TBPS ( 68.4%) 126.31 TFLOPS ( 2.5%) 4.44 TBPS ( 55.5%)
320 205.17 TFLOPS ( 4.1%) 5.46 TBPS ( 68.2%) 161.57 TFLOPS ( 3.2%) 4.30 TBPS ( 53.7%)
384 251.10 TFLOPS ( 5.0%) 5.50 TBPS ( 68.8%) 195.10 TFLOPS ( 3.9%) 4.28 TBPS ( 53.5%)
448 295.50 TFLOPS ( 5.9%) 5.53 TBPS ( 69.2%) 228.63 TFLOPS ( 4.6%) 4.28 TBPS ( 53.5%)
512 316.07 TFLOPS ( 6.3%) 5.34 TBPS ( 66.7%) 254.07 TFLOPS ( 5.1%) 4.29 TBPS ( 53.6%)
640 414.90 TFLOPS ( 8.3%) 5.43 TBPS ( 67.9%) 293.48 TFLOPS ( 5.9%) 3.84 TBPS ( 48.0%)
768 489.21 TFLOPS ( 9.8%) 5.43 TBPS ( 67.9%) 350.19 TFLOPS ( 7.0%) 3.89 TBPS ( 48.6%)
896 553.62 TFLOPS ( 11.1%) 5.39 TBPS ( 67.4%) 405.50 TFLOPS ( 8.1%) 3.95 TBPS ( 49.4%)
1024 576.71 TFLOPS ( 11.5%) 4.92 TBPS ( 61.5%) 463.70 TFLOPS ( 9.3%) 3.95 TBPS ( 49.4%)
1280 682.95 TFLOPS ( 13.7%) 4.76 TBPS ( 59.5%) 571.87 TFLOPS ( 11.4%) 3.98 TBPS ( 49.8%)
1536 837.25 TFLOPS ( 16.7%) 4.99 TBPS ( 62.4%) 679.20 TFLOPS ( 13.6%) 4.05 TBPS ( 50.6%)
1792 934.09 TFLOPS ( 18.7%) 4.84 TBPS ( 60.5%) 686.46 TFLOPS ( 13.7%) 3.56 TBPS ( 44.5%)
2048 937.04 TFLOPS ( 18.7%) 4.25 TBPS ( 53.2%) 714.85 TFLOPS ( 14.3%) 3.24 TBPS ( 40.6%)
2560 1081.82 TFLOPS ( 21.6%) 4.07 TBPS ( 50.9%) 1020.85 TFLOPS ( 20.4%) 3.84 TBPS ( 48.0%)
3072 1313.97 TFLOPS ( 26.3%) 4.21 TBPS ( 52.7%) 1209.26 TFLOPS ( 24.2%) 3.88 TBPS ( 48.5%)
3584 1533.25 TFLOPS ( 30.7%) 4.27 TBPS ( 53.3%) 1410.67 TFLOPS ( 28.2%) 3.93 TBPS ( 49.1%)
4096 1399.52 TFLOPS ( 28.0%) 3.42 TBPS ( 42.7%) 1245.96 TFLOPS ( 24.9%) 3.04 TBPS ( 38.1%)
5120 1489.01 TFLOPS ( 29.8%) 2.97 TBPS ( 37.1%) 1320.69 TFLOPS ( 26.4%) 2.63 TBPS ( 32.9%)
6144 1901.82 TFLOPS ( 38.0%) 3.21 TBPS ( 40.2%) 1638.02 TFLOPS ( 32.8%) 2.77 TBPS ( 34.6%)
7168 2234.87 TFLOPS ( 44.7%) 3.28 TBPS ( 41.0%) 1987.25 TFLOPS ( 39.7%) 2.92 TBPS ( 36.5%)
8192 1999.71 TFLOPS ( 40.0%) 2.59 TBPS ( 32.4%) 1747.04 TFLOPS ( 34.9%) 2.26 TBPS ( 28.3%)
9216 2024.82 TFLOPS ( 40.5%) 2.38 TBPS ( 29.7%) 1802.45 TFLOPS ( 36.0%) 2.12 TBPS ( 26.5%)
10240 2233.54 TFLOPS ( 44.7%) 2.38 TBPS ( 29.8%) 1987.52 TFLOPS ( 39.8%) 2.12 TBPS ( 26.5%)
11264 2423.11 TFLOPS ( 48.5%) 2.40 TBPS ( 30.0%) 2163.00 TFLOPS ( 43.3%) 2.14 TBPS ( 26.8%)
12288 2230.25 TFLOPS ( 44.6%) 2.05 TBPS ( 25.7%) 1971.60 TFLOPS ( 39.4%) 1.82 TBPS ( 22.7%)
13312 2386.76 TFLOPS ( 47.7%) 2.07 TBPS ( 25.9%) 2089.59 TFLOPS ( 41.8%) 1.81 TBPS ( 22.7%)
14336 2535.85 TFLOPS ( 50.7%) 2.08 TBPS ( 25.9%) 2214.30 TFLOPS ( 44.3%) 1.81 TBPS ( 22.7%)
15360 2708.63 TFLOPS ( 54.2%) 2.10 TBPS ( 26.3%) 2371.09 TFLOPS ( 47.4%) 1.84 TBPS ( 23.0%)
16384 2495.78 TFLOPS ( 49.9%) 1.85 TBPS ( 23.1%) 2256.94 TFLOPS ( 45.1%) 1.67 TBPS ( 20.9%)
17408 2613.94 TFLOPS ( 52.3%) 1.85 TBPS ( 23.1%) 2350.97 TFLOPS ( 47.0%) 1.66 TBPS ( 20.8%)
18432 2600.95 TFLOPS ( 52.0%) 1.76 TBPS ( 22.0%) 2376.24 TFLOPS ( 47.5%) 1.61 TBPS ( 20.1%)
19456 2720.43 TFLOPS ( 54.4%) 1.77 TBPS ( 22.1%) 2445.94 TFLOPS ( 48.9%) 1.59 TBPS ( 19.9%)
20480 2694.19 TFLOPS ( 53.9%) 1.68 TBPS ( 21.0%) 2418.54 TFLOPS ( 48.4%) 1.51 TBPS ( 18.9%)
21504 2667.33 TFLOPS ( 53.3%) 1.61 TBPS ( 20.1%) 2432.76 TFLOPS ( 48.7%) 1.47 TBPS ( 18.4%)
22528 2773.18 TFLOPS ( 55.5%) 1.62 TBPS ( 20.3%) 2531.76 TFLOPS ( 50.6%) 1.48 TBPS ( 18.5%)
23552 2867.55 TFLOPS ( 57.4%) 1.62 TBPS ( 20.3%) 2604.20 TFLOPS ( 52.1%) 1.48 TBPS ( 18.4%)
24576 2736.19 TFLOPS ( 54.7%) 1.50 TBPS ( 18.8%) 2499.52 TFLOPS ( 50.0%) 1.37 TBPS ( 17.2%)
25600 2789.45 TFLOPS ( 55.8%) 1.49 TBPS ( 18.6%) 2554.81 TFLOPS ( 51.1%) 1.37 TBPS ( 17.1%)
26624 2882.35 TFLOPS ( 57.6%) 1.50 TBPS ( 18.7%) 2669.93 TFLOPS ( 53.4%) 1.39 TBPS ( 17.4%)
27648 2856.47 TFLOPS ( 57.1%) 1.45 TBPS ( 18.1%) 2624.29 TFLOPS ( 52.5%) 1.33 TBPS ( 16.7%)
28672 2851.52 TFLOPS ( 57.0%) 1.41 TBPS ( 17.7%) 2621.20 TFLOPS ( 52.4%) 1.30 TBPS ( 16.2%)
29696 2826.90 TFLOPS ( 56.5%) 1.37 TBPS ( 17.1%) 2483.67 TFLOPS ( 49.7%) 1.20 TBPS ( 15.1%)
30720 2894.51 TFLOPS ( 57.9%) 1.37 TBPS ( 17.2%) 2527.41 TFLOPS ( 50.5%) 1.20 TBPS ( 15.0%)
31744 2875.63 TFLOPS ( 57.5%) 1.34 TBPS ( 16.7%) 2505.68 TFLOPS ( 50.1%) 1.16 TBPS ( 14.5%)
```
bingyizh233
pushed a commit
to bingyizh233/triton
that referenced
this pull request
Apr 20, 2026
The main beneficial optimizations are:
1. Separate loader for weight and weight scales, allowing asymmetric
pipelining. Potentially we can have a separate partition for scale
factor load as well, but I didn't experiment with it.
2. Epilogue optimizations to decrease the number of instructions.
Epilogue instruction issue (especially with 8 warps) staves the other
warps of instruction issuing. Optimizing it improves overall
performance. I repurposed the 2 idle leftover warps as a store partition
to decrease the critical path in the epilogue.
gpt-oss-120b shapes, performance vs. `triton_kernels.matmul` on
synthetic "realistic" logits.
```
GPT-OSS-120B MoE MM1 E=128 EP=4 ES=8 B=2880x5760
Peak: 5 PFLOPS, 8 TBPS
batch_size example reference
----------------------------------------------------------------------------------------------------------------
128 29.12 TFLOPS ( 0.6%) 5.27 TBPS ( 65.9%) 22.74 TFLOPS ( 0.5%) 4.11 TBPS ( 51.4%)
160 36.67 TFLOPS ( 0.7%) 5.96 TBPS ( 74.5%) 27.66 TFLOPS ( 0.6%) 4.49 TBPS ( 56.2%)
192 39.57 TFLOPS ( 0.8%) 6.00 TBPS ( 75.0%) 28.94 TFLOPS ( 0.6%) 4.39 TBPS ( 54.9%)
224 52.64 TFLOPS ( 1.1%) 5.99 TBPS ( 74.9%) 39.21 TFLOPS ( 0.8%) 4.46 TBPS ( 55.8%)
256 55.48 TFLOPS ( 1.1%) 6.46 TBPS ( 80.7%) 40.17 TFLOPS ( 0.8%) 4.67 TBPS ( 58.4%)
320 67.89 TFLOPS ( 1.4%) 6.64 TBPS ( 83.0%) 41.19 TFLOPS ( 0.8%) 4.03 TBPS ( 50.4%)
384 73.96 TFLOPS ( 1.5%) 6.43 TBPS ( 80.4%) 45.74 TFLOPS ( 0.9%) 3.98 TBPS ( 49.7%)
448 91.18 TFLOPS ( 1.8%) 6.64 TBPS ( 83.0%) 53.82 TFLOPS ( 1.1%) 3.92 TBPS ( 49.0%)
512 78.16 TFLOPS ( 1.6%) 5.36 TBPS ( 67.0%) 60.78 TFLOPS ( 1.2%) 4.17 TBPS ( 52.1%)
640 90.66 TFLOPS ( 1.8%) 5.17 TBPS ( 64.6%) 58.41 TFLOPS ( 1.2%) 3.33 TBPS ( 41.6%)
768 93.65 TFLOPS ( 1.9%) 4.98 TBPS ( 62.3%) 61.17 TFLOPS ( 1.2%) 3.25 TBPS ( 40.7%)
896 103.66 TFLOPS ( 2.1%) 5.04 TBPS ( 63.1%) 66.78 TFLOPS ( 1.3%) 3.25 TBPS ( 40.6%)
1024 135.60 TFLOPS ( 2.7%) 5.16 TBPS ( 64.5%) 86.31 TFLOPS ( 1.7%) 3.28 TBPS ( 41.0%)
1280 131.21 TFLOPS ( 2.6%) 3.80 TBPS ( 47.5%) 119.41 TFLOPS ( 2.4%) 3.46 TBPS ( 43.3%)
1536 175.65 TFLOPS ( 3.5%) 4.26 TBPS ( 53.3%) 145.39 TFLOPS ( 2.9%) 3.53 TBPS ( 44.1%)
1792 181.42 TFLOPS ( 3.6%) 3.85 TBPS ( 48.1%) 148.73 TFLOPS ( 3.0%) 3.15 TBPS ( 39.4%)
2048 188.32 TFLOPS ( 3.8%) 3.96 TBPS ( 49.5%) 165.47 TFLOPS ( 3.3%) 3.48 TBPS ( 43.5%)
2560 205.57 TFLOPS ( 4.1%) 3.45 TBPS ( 43.2%) 177.60 TFLOPS ( 3.6%) 2.98 TBPS ( 37.3%)
3072 263.31 TFLOPS ( 5.3%) 3.62 TBPS ( 45.2%) 253.05 TFLOPS ( 5.1%) 3.48 TBPS ( 43.5%)
3584 283.21 TFLOPS ( 5.7%) 3.38 TBPS ( 42.2%) 290.26 TFLOPS ( 5.8%) 3.46 TBPS ( 43.2%)
4096 348.60 TFLOPS ( 7.0%) 3.61 TBPS ( 45.2%) 329.95 TFLOPS ( 6.6%) 3.42 TBPS ( 42.8%)
5120 421.37 TFLOPS ( 8.4%) 3.57 TBPS ( 44.6%) 385.63 TFLOPS ( 7.7%) 3.27 TBPS ( 40.8%)
6144 524.65 TFLOPS ( 10.5%) 3.69 TBPS ( 46.1%) 473.72 TFLOPS ( 9.5%) 3.33 TBPS ( 41.6%)
7168 626.20 TFLOPS ( 12.5%) 3.79 TBPS ( 47.4%) 582.67 TFLOPS ( 11.7%) 3.53 TBPS ( 44.1%)
8192 675.19 TFLOPS ( 13.5%) 3.64 TBPS ( 45.6%) 629.79 TFLOPS ( 12.6%) 3.40 TBPS ( 42.5%)
9216 729.35 TFLOPS ( 14.6%) 3.62 TBPS ( 45.2%) 674.98 TFLOPS ( 13.5%) 3.35 TBPS ( 41.9%)
10240 749.61 TFLOPS ( 15.0%) 3.40 TBPS ( 42.4%) 685.04 TFLOPS ( 13.7%) 3.10 TBPS ( 38.8%)
11264 806.46 TFLOPS ( 16.1%) 3.34 TBPS ( 41.8%) 721.29 TFLOPS ( 14.4%) 2.99 TBPS ( 37.3%)
12288 895.78 TFLOPS ( 17.9%) 3.38 TBPS ( 42.3%) 805.38 TFLOPS ( 16.1%) 3.04 TBPS ( 38.0%)
13312 1008.57 TFLOPS ( 20.2%) 3.53 TBPS ( 44.1%) 921.30 TFLOPS ( 18.4%) 3.22 TBPS ( 40.3%)
14336 1001.71 TFLOPS ( 20.0%) 3.24 TBPS ( 40.5%) 915.14 TFLOPS ( 18.3%) 2.96 TBPS ( 37.0%)
15360 1108.54 TFLOPS ( 22.2%) 3.35 TBPS ( 41.8%) 1020.08 TFLOPS ( 20.4%) 3.08 TBPS ( 38.5%)
16384 1182.28 TFLOPS ( 23.6%) 3.41 TBPS ( 42.6%) 1083.60 TFLOPS ( 21.7%) 3.12 TBPS ( 39.0%)
17408 1255.35 TFLOPS ( 25.1%) 3.44 TBPS ( 43.0%) 1106.65 TFLOPS ( 22.1%) 3.03 TBPS ( 37.9%)
18432 1339.98 TFLOPS ( 26.8%) 3.50 TBPS ( 43.7%) 1188.63 TFLOPS ( 23.8%) 3.10 TBPS ( 38.8%)
19456 1411.14 TFLOPS ( 28.2%) 3.49 TBPS ( 43.6%) 1239.33 TFLOPS ( 24.8%) 3.06 TBPS ( 38.3%)
20480 1465.49 TFLOPS ( 29.3%) 3.46 TBPS ( 43.3%) 1329.22 TFLOPS ( 26.6%) 3.14 TBPS ( 39.2%)
21504 1429.77 TFLOPS ( 28.6%) 3.23 TBPS ( 40.3%) 1228.47 TFLOPS ( 24.6%) 2.77 TBPS ( 34.7%)
22528 1432.90 TFLOPS ( 28.7%) 3.08 TBPS ( 38.5%) 1283.04 TFLOPS ( 25.7%) 2.76 TBPS ( 34.5%)
23552 1530.40 TFLOPS ( 30.6%) 3.17 TBPS ( 39.6%) 1304.38 TFLOPS ( 26.1%) 2.70 TBPS ( 33.8%)
24576 1531.43 TFLOPS ( 30.6%) 3.05 TBPS ( 38.1%) 1378.61 TFLOPS ( 27.6%) 2.74 TBPS ( 34.3%)
25600 1609.22 TFLOPS ( 32.2%) 3.09 TBPS ( 38.7%) 1403.54 TFLOPS ( 28.1%) 2.70 TBPS ( 33.7%)
26624 1591.86 TFLOPS ( 31.8%) 2.96 TBPS ( 37.0%) 1461.78 TFLOPS ( 29.2%) 2.72 TBPS ( 34.0%)
27648 1708.82 TFLOPS ( 34.2%) 3.08 TBPS ( 38.5%) 1554.22 TFLOPS ( 31.1%) 2.80 TBPS ( 35.0%)
28672 1715.65 TFLOPS ( 34.3%) 3.00 TBPS ( 37.5%) 1582.07 TFLOPS ( 31.6%) 2.77 TBPS ( 34.6%)
29696 1756.95 TFLOPS ( 35.1%) 2.98 TBPS ( 37.3%) 1562.01 TFLOPS ( 31.2%) 2.65 TBPS ( 33.1%)
30720 1801.20 TFLOPS ( 36.0%) 2.97 TBPS ( 37.2%) 1681.00 TFLOPS ( 33.6%) 2.77 TBPS ( 34.7%)
31744 1979.10 TFLOPS ( 39.6%) 3.17 TBPS ( 39.7%) 1754.99 TFLOPS ( 35.1%) 2.82 TBPS ( 35.2%)
```
And on uniform logits
```
batch_size example reference
----------------------------------------------------------------------------------------------------------------
128 73.67 TFLOPS ( 1.5%) 5.41 TBPS ( 67.7%) 60.70 TFLOPS ( 1.2%) 4.46 TBPS ( 55.7%)
160 96.46 TFLOPS ( 1.9%) 5.50 TBPS ( 68.7%) 78.10 TFLOPS ( 1.6%) 4.45 TBPS ( 55.6%)
192 115.43 TFLOPS ( 2.3%) 5.43 TBPS ( 67.9%) 94.71 TFLOPS ( 1.9%) 4.45 TBPS ( 55.7%)
224 141.31 TFLOPS ( 2.8%) 5.47 TBPS ( 68.3%) 115.26 TFLOPS ( 2.3%) 4.46 TBPS ( 55.7%)
256 155.77 TFLOPS ( 3.1%) 5.47 TBPS ( 68.4%) 126.31 TFLOPS ( 2.5%) 4.44 TBPS ( 55.5%)
320 205.17 TFLOPS ( 4.1%) 5.46 TBPS ( 68.2%) 161.57 TFLOPS ( 3.2%) 4.30 TBPS ( 53.7%)
384 251.10 TFLOPS ( 5.0%) 5.50 TBPS ( 68.8%) 195.10 TFLOPS ( 3.9%) 4.28 TBPS ( 53.5%)
448 295.50 TFLOPS ( 5.9%) 5.53 TBPS ( 69.2%) 228.63 TFLOPS ( 4.6%) 4.28 TBPS ( 53.5%)
512 316.07 TFLOPS ( 6.3%) 5.34 TBPS ( 66.7%) 254.07 TFLOPS ( 5.1%) 4.29 TBPS ( 53.6%)
640 414.90 TFLOPS ( 8.3%) 5.43 TBPS ( 67.9%) 293.48 TFLOPS ( 5.9%) 3.84 TBPS ( 48.0%)
768 489.21 TFLOPS ( 9.8%) 5.43 TBPS ( 67.9%) 350.19 TFLOPS ( 7.0%) 3.89 TBPS ( 48.6%)
896 553.62 TFLOPS ( 11.1%) 5.39 TBPS ( 67.4%) 405.50 TFLOPS ( 8.1%) 3.95 TBPS ( 49.4%)
1024 576.71 TFLOPS ( 11.5%) 4.92 TBPS ( 61.5%) 463.70 TFLOPS ( 9.3%) 3.95 TBPS ( 49.4%)
1280 682.95 TFLOPS ( 13.7%) 4.76 TBPS ( 59.5%) 571.87 TFLOPS ( 11.4%) 3.98 TBPS ( 49.8%)
1536 837.25 TFLOPS ( 16.7%) 4.99 TBPS ( 62.4%) 679.20 TFLOPS ( 13.6%) 4.05 TBPS ( 50.6%)
1792 934.09 TFLOPS ( 18.7%) 4.84 TBPS ( 60.5%) 686.46 TFLOPS ( 13.7%) 3.56 TBPS ( 44.5%)
2048 937.04 TFLOPS ( 18.7%) 4.25 TBPS ( 53.2%) 714.85 TFLOPS ( 14.3%) 3.24 TBPS ( 40.6%)
2560 1081.82 TFLOPS ( 21.6%) 4.07 TBPS ( 50.9%) 1020.85 TFLOPS ( 20.4%) 3.84 TBPS ( 48.0%)
3072 1313.97 TFLOPS ( 26.3%) 4.21 TBPS ( 52.7%) 1209.26 TFLOPS ( 24.2%) 3.88 TBPS ( 48.5%)
3584 1533.25 TFLOPS ( 30.7%) 4.27 TBPS ( 53.3%) 1410.67 TFLOPS ( 28.2%) 3.93 TBPS ( 49.1%)
4096 1399.52 TFLOPS ( 28.0%) 3.42 TBPS ( 42.7%) 1245.96 TFLOPS ( 24.9%) 3.04 TBPS ( 38.1%)
5120 1489.01 TFLOPS ( 29.8%) 2.97 TBPS ( 37.1%) 1320.69 TFLOPS ( 26.4%) 2.63 TBPS ( 32.9%)
6144 1901.82 TFLOPS ( 38.0%) 3.21 TBPS ( 40.2%) 1638.02 TFLOPS ( 32.8%) 2.77 TBPS ( 34.6%)
7168 2234.87 TFLOPS ( 44.7%) 3.28 TBPS ( 41.0%) 1987.25 TFLOPS ( 39.7%) 2.92 TBPS ( 36.5%)
8192 1999.71 TFLOPS ( 40.0%) 2.59 TBPS ( 32.4%) 1747.04 TFLOPS ( 34.9%) 2.26 TBPS ( 28.3%)
9216 2024.82 TFLOPS ( 40.5%) 2.38 TBPS ( 29.7%) 1802.45 TFLOPS ( 36.0%) 2.12 TBPS ( 26.5%)
10240 2233.54 TFLOPS ( 44.7%) 2.38 TBPS ( 29.8%) 1987.52 TFLOPS ( 39.8%) 2.12 TBPS ( 26.5%)
11264 2423.11 TFLOPS ( 48.5%) 2.40 TBPS ( 30.0%) 2163.00 TFLOPS ( 43.3%) 2.14 TBPS ( 26.8%)
12288 2230.25 TFLOPS ( 44.6%) 2.05 TBPS ( 25.7%) 1971.60 TFLOPS ( 39.4%) 1.82 TBPS ( 22.7%)
13312 2386.76 TFLOPS ( 47.7%) 2.07 TBPS ( 25.9%) 2089.59 TFLOPS ( 41.8%) 1.81 TBPS ( 22.7%)
14336 2535.85 TFLOPS ( 50.7%) 2.08 TBPS ( 25.9%) 2214.30 TFLOPS ( 44.3%) 1.81 TBPS ( 22.7%)
15360 2708.63 TFLOPS ( 54.2%) 2.10 TBPS ( 26.3%) 2371.09 TFLOPS ( 47.4%) 1.84 TBPS ( 23.0%)
16384 2495.78 TFLOPS ( 49.9%) 1.85 TBPS ( 23.1%) 2256.94 TFLOPS ( 45.1%) 1.67 TBPS ( 20.9%)
17408 2613.94 TFLOPS ( 52.3%) 1.85 TBPS ( 23.1%) 2350.97 TFLOPS ( 47.0%) 1.66 TBPS ( 20.8%)
18432 2600.95 TFLOPS ( 52.0%) 1.76 TBPS ( 22.0%) 2376.24 TFLOPS ( 47.5%) 1.61 TBPS ( 20.1%)
19456 2720.43 TFLOPS ( 54.4%) 1.77 TBPS ( 22.1%) 2445.94 TFLOPS ( 48.9%) 1.59 TBPS ( 19.9%)
20480 2694.19 TFLOPS ( 53.9%) 1.68 TBPS ( 21.0%) 2418.54 TFLOPS ( 48.4%) 1.51 TBPS ( 18.9%)
21504 2667.33 TFLOPS ( 53.3%) 1.61 TBPS ( 20.1%) 2432.76 TFLOPS ( 48.7%) 1.47 TBPS ( 18.4%)
22528 2773.18 TFLOPS ( 55.5%) 1.62 TBPS ( 20.3%) 2531.76 TFLOPS ( 50.6%) 1.48 TBPS ( 18.5%)
23552 2867.55 TFLOPS ( 57.4%) 1.62 TBPS ( 20.3%) 2604.20 TFLOPS ( 52.1%) 1.48 TBPS ( 18.4%)
24576 2736.19 TFLOPS ( 54.7%) 1.50 TBPS ( 18.8%) 2499.52 TFLOPS ( 50.0%) 1.37 TBPS ( 17.2%)
25600 2789.45 TFLOPS ( 55.8%) 1.49 TBPS ( 18.6%) 2554.81 TFLOPS ( 51.1%) 1.37 TBPS ( 17.1%)
26624 2882.35 TFLOPS ( 57.6%) 1.50 TBPS ( 18.7%) 2669.93 TFLOPS ( 53.4%) 1.39 TBPS ( 17.4%)
27648 2856.47 TFLOPS ( 57.1%) 1.45 TBPS ( 18.1%) 2624.29 TFLOPS ( 52.5%) 1.33 TBPS ( 16.7%)
28672 2851.52 TFLOPS ( 57.0%) 1.41 TBPS ( 17.7%) 2621.20 TFLOPS ( 52.4%) 1.30 TBPS ( 16.2%)
29696 2826.90 TFLOPS ( 56.5%) 1.37 TBPS ( 17.1%) 2483.67 TFLOPS ( 49.7%) 1.20 TBPS ( 15.1%)
30720 2894.51 TFLOPS ( 57.9%) 1.37 TBPS ( 17.2%) 2527.41 TFLOPS ( 50.5%) 1.20 TBPS ( 15.0%)
31744 2875.63 TFLOPS ( 57.5%) 1.34 TBPS ( 16.7%) 2505.68 TFLOPS ( 50.1%) 1.16 TBPS ( 14.5%)
```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The main beneficial optimizations are:
gpt-oss-120b shapes, performance vs.
triton_kernels.matmulon synthetic "realistic" logits.And on uniform logits