[Triton] Bench mha dao_ai impl by micmelesse · Pull Request #2542 · ROCm/aiter

micmelesse · 2026-03-31T01:00:15Z

Motivation

This pr benches the code in flash_attention_triton_amd which is used in upstream flash attention using the bench_mha.py. It adds mha_set_impl("dao_ai") to dispatch forward/backward through the flash_attn_2 codepath, and refactors bench_mha.py. Tests in test_mha.py now parametrize over both default and dao_ai implementations.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-03-31T01:00:34Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2542 --add-label <label>

Copilot

Pull request overview

Adds a selectable “dao_ai” MHA forward implementation (backed by flash_attn_triton_amd / flash_attn_2) and wires it into the existing Triton MHA benchmark script and the flash-attention integration workflow so CI can produce benchmark artifacts.

Changes:

Add -impl {default,dao_ai} to bench_mha.py, expand benchmark config grids (including GQA + causal axis), and annotate provider labels with the impl.
Introduce a global MHA forward-impl switch in aiter/ops/triton/attention/mha.py that routes forward to flash_attn_2 when selected.
Extend .github/workflows/flash_attention_integration.yaml to run and upload dao_ai kernel + model benchmarks (and rename existing benchmark log artifacts).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`op_tests/op_benchmarks/triton/bench_mha.py`	Adds `-impl` flag, includes causal as a benchmark dimension, expands head/seq config coverage, and updates provider labeling.
`aiter/ops/triton/attention/mha.py`	Adds global impl selection and routes forward to `flash_attn_2` for `dao_ai`.
`.github/workflows/flash_attention_integration.yaml`	Runs dao_ai benchmarks in CI and uploads logs/CSVs as artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

add impl bench simple config list edge cases lint more configs seperate scenario add batch config split branches print loop limit configs add literal scenarios workload function try catch better print csv writer add skip save save try free arch check again save save try again save fp8 skip save save lint

brunomazzottiamd

LGTM! Let's wait for the CI and then merge it.

brunomazzottiamd · 2026-04-09T12:37:07Z

The following failures on Triton Tests (MI35X) / Shard 0 are expected:

=========================== short test summary info ============================
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-4]
==== 12 failed, 5269 passed, 2016 skipped, 6 warnings in 4842.04s (1:20:42) ====

Everything else have passed, we're good to merge.

micmelesse · 2026-04-09T15:22:41Z

The bench results on MI5x and RDNA3 are at https://github.com/ROCm/aiter/actions/runs/24152934159/job/70485243064. I have posted the data below.

Here are the MI5X numbers

model	BATCH	HQ	HK	N_CTX_Q	N_CTX_K	D_HEAD	D_HEAD_V	causal	function	dtype	impl	fused	TFLOPS
llama3-8B	32	32	8	256	256	128	128	True	fwd	bf16	dao_ai	False	104.37
llama3-8B	32	32	8	256	256	128	128	True	bwd	bf16	dao_ai	False	101.09
llama3-8B	32	32	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	52.90
llama3-8B	32	32	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	59.47
llama3-8B	8	32	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	336.49
llama3-8B	8	32	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	192.25
llama3-8B	8	32	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	167.41
llama3-8B	8	32	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	136.63
llama3-8B	1	32	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	568.97
llama3-8B	1	32	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	328.52
llama3-8B	1	32	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	467.91
llama3-8B	1	32	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	275.01
llama3-8B	8	32	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	653.23
llama3-8B	8	32	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	329.63
llama3-8B	8	32	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	471.38
llama3-8B	8	32	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	272.11
llama3-8B	64	32	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	3.72
llama3-8B	64	32	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.03
llama3-8B	64	32	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.10
llama3-70B	32	64	8	256	256	128	128	True	fwd	bf16	dao_ai	False	183.38
llama3-70B	32	64	8	256	256	128	128	True	bwd	bf16	dao_ai	False	105.30
llama3-70B	32	64	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	82.60
llama3-70B	32	64	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	77.13
llama3-70B	8	64	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	399.05
llama3-70B	8	64	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	197.68
llama3-70B	8	64	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	239.43
llama3-70B	8	64	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	153.52
llama3-70B	1	64	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	617.68
llama3-70B	1	64	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	322.57
llama3-70B	1	64	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	538.18
llama3-70B	1	64	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	281.01
llama3-70B	8	64	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	694.00
llama3-70B	8	64	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	334.41
llama3-70B	8	64	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	538.90
llama3-70B	8	64	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	281.06
llama3-70B	64	64	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.00
llama3-70B	64	64	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.28
llama3-70B	64	64	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.42
llama3-405B	32	128	8	256	256	128	128	True	fwd	bf16	dao_ai	False	191.76
llama3-405B	32	128	8	256	256	128	128	True	bwd	bf16	dao_ai	False	107.42
llama3-405B	32	128	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	110.54
llama3-405B	32	128	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	85.40
llama3-405B	8	128	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	419.41
llama3-405B	8	128	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	202.22
llama3-405B	8	128	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	303.46
llama3-405B	8	128	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	165.99
llama3-405B	1	128	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	632.13
llama3-405B	1	128	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	319.43
llama3-405B	1	128	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	585.47
llama3-405B	1	128	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	281.29
llama3-405B	8	128	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	686.85
llama3-405B	8	128	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	328.35
llama3-405B	8	128	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	581.36
llama3-405B	8	128	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	283.53
llama3-405B	64	128	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.33
llama3-405B	64	128	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.70
llama3-405B	64	128	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.72
mixtral-7B	32	32	8	256	256	128	128	True	fwd	bf16	dao_ai	False	101.64
mixtral-7B	32	32	8	256	256	128	128	True	bwd	bf16	dao_ai	False	102.20
mixtral-7B	32	32	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	51.54
mixtral-7B	32	32	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	58.67
mixtral-7B	8	32	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	336.52
mixtral-7B	8	32	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	191.87
mixtral-7B	8	32	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	164.68
mixtral-7B	8	32	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	136.21
mixtral-7B	1	32	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	568.21
mixtral-7B	1	32	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	330.51
mixtral-7B	1	32	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	465.56
mixtral-7B	1	32	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	273.98
mixtral-7B	8	32	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	652.98
mixtral-7B	8	32	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	327.49
mixtral-7B	8	32	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	470.39
mixtral-7B	8	32	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	272.74
mixtral-7B	64	32	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	3.72
mixtral-7B	64	32	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.01
mixtral-7B	64	32	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.10
mixtral-22B	32	48	8	256	256	128	128	True	fwd	bf16	dao_ai	False	147.98
mixtral-22B	32	48	8	256	256	128	128	True	bwd	bf16	dao_ai	False	102.92
mixtral-22B	32	48	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	69.39
mixtral-22B	32	48	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	69.28
mixtral-22B	8	48	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	377.39
mixtral-22B	8	48	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	190.86
mixtral-22B	8	48	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	212.57
mixtral-22B	8	48	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	145.01
mixtral-22B	1	48	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	593.57
mixtral-22B	1	48	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	327.92
mixtral-22B	1	48	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	504.81
mixtral-22B	1	48	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	276.84
mixtral-22B	8	48	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	669.42
mixtral-22B	8	48	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	329.64
mixtral-22B	8	48	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	523.02
mixtral-22B	8	48	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	276.83
mixtral-22B	64	48	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	3.99
mixtral-22B	64	48	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.28
mixtral-22B	64	48	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	4.47
deepseek-V3	32	128	128	256	256	56	56	True	fwd	bf16	dao_ai	False	96.47
deepseek-V3	32	128	128	256	256	56	56	True	bwd	bf16	dao_ai	False	42.08
deepseek-V3	32	128	128	256	256	56	56	True	fwd_varlen	bf16	dao_ai	False	53.76
deepseek-V3	32	128	128	256	256	56	56	True	bwd_varlen	bf16	dao_ai	False	33.61
deepseek-V3	8	128	128	1024	1024	56	56	True	fwd	bf16	dao_ai	False	197.39
deepseek-V3	8	128	128	1024	1024	56	56	True	bwd	bf16	dao_ai	False	103.60
deepseek-V3	8	128	128	1024	1024	56	56	True	fwd_varlen	bf16	dao_ai	False	142.15
deepseek-V3	8	128	128	1024	1024	56	56	True	bwd_varlen	bf16	dao_ai	False	87.53
deepseek-V3	1	128	128	8192	8192	56	56	True	fwd	bf16	dao_ai	False	302.74
deepseek-V3	1	128	128	8192	8192	56	56	True	bwd	bf16	dao_ai	False	174.83
deepseek-V3	1	128	128	8192	8192	56	56	True	fwd_varlen	bf16	dao_ai	False	264.71
deepseek-V3	1	128	128	8192	8192	56	56	True	bwd_varlen	bf16	dao_ai	False	150.80
deepseek-V3	8	128	128	1024	4096	56	56	False	fwd	bf16	dao_ai	False	335.64
deepseek-V3	8	128	128	1024	4096	56	56	False	bwd	bf16	dao_ai	False	223.76
deepseek-V3	8	128	128	1024	4096	56	56	False	fwd_varlen	bf16	dao_ai	False	292.88
deepseek-V3	8	128	128	1024	4096	56	56	False	bwd_varlen	bf16	dao_ai	False	207.64
deepseek-V3	64	128	128	1	1024	56	56	True	fwd_kvcache	bf16	dao_ai	False	1.66
deepseek-V3	64	128	128	1	4096	56	56	True	fwd_kvcache	bf16	dao_ai	False	1.79
deepseek-V3	64	128	128	1	8192	56	56	True	fwd_kvcache	bf16	dao_ai	False	1.81

Here are the RDNA3 numbers

model	BATCH	HQ	HK	N_CTX_Q	N_CTX_K	D_HEAD	D_HEAD_V	causal	function	dtype	impl	fused	TFLOPS
llama3-8B	32	32	8	256	256	128	128	True	fwd	bf16	dao_ai	False	14.75
llama3-8B	32	32	8	256	256	128	128	True	bwd	bf16	dao_ai	False	1.74
llama3-8B	32	32	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	9.65
llama3-8B	32	32	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	1.51
llama3-8B	8	32	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	33.82
llama3-8B	8	32	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	2.50
llama3-8B	8	32	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	24.46
llama3-8B	8	32	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	2.38
llama3-8B	1	32	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	49.09
llama3-8B	1	32	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	2.67
llama3-8B	1	32	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	42.25
llama3-8B	1	32	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	2.58
llama3-8B	8	32	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	51.93
llama3-8B	8	32	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	3.16
llama3-8B	8	32	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	45.11
llama3-8B	8	32	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	3.11
llama3-8B	64	32	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.86
llama3-8B	64	32	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.89
llama3-8B	64	32	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.87
llama3-70B	32	64	8	256	256	128	128	True	fwd	bf16	dao_ai	False	14.12
llama3-70B	32	64	8	256	256	128	128	True	bwd	bf16	dao_ai	False	1.75
llama3-70B	32	64	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	10.07
llama3-70B	32	64	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	1.53
llama3-70B	8	64	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	33.39
llama3-70B	8	64	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	2.52
llama3-70B	8	64	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	25.36
llama3-70B	8	64	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	2.39
llama3-70B	1	64	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	49.26
llama3-70B	1	64	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	2.32
llama3-70B	1	64	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	40.94
llama3-70B	1	64	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	2.24
llama3-70B	8	64	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	52.25
llama3-70B	8	64	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	2.93
llama3-70B	8	64	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	44.76
llama3-70B	8	64	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	2.88
llama3-70B	64	64	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.89
llama3-70B	64	64	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.92
llama3-70B	64	64	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.90
llama3-405B	32	128	8	256	256	128	128	True	fwd	bf16	dao_ai	False	14.89
llama3-405B	32	128	8	256	256	128	128	True	bwd	bf16	dao_ai	False	1.71
llama3-405B	32	128	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	10.58
llama3-405B	32	128	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	1.48
llama3-405B	8	128	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	34.05
llama3-405B	8	128	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	2.32
llama3-405B	8	128	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	25.78
llama3-405B	8	128	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	2.21
llama3-405B	1	128	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	46.43
llama3-405B	1	128	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	1.91
llama3-405B	1	128	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	39.33
llama3-405B	1	128	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	1.79
llama3-405B	8	128	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	48.73
llama3-405B	8	128	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	2.75
llama3-405B	8	128	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	43.12
llama3-405B	8	128	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	2.72
llama3-405B	64	128	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.89
llama3-405B	64	128	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.93
llama3-405B	64	128	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.86
mixtral-7B	32	32	8	256	256	128	128	True	fwd	bf16	dao_ai	False	13.89
mixtral-7B	32	32	8	256	256	128	128	True	bwd	bf16	dao_ai	False	1.68
mixtral-7B	32	32	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	9.39
mixtral-7B	32	32	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	1.53
mixtral-7B	8	32	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	31.38
mixtral-7B	8	32	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	2.52
mixtral-7B	8	32	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	24.08
mixtral-7B	8	32	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	2.39
mixtral-7B	1	32	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	45.25
mixtral-7B	1	32	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	2.69
mixtral-7B	1	32	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	41.00
mixtral-7B	1	32	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	2.62
mixtral-7B	8	32	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	48.40
mixtral-7B	8	32	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	3.15
mixtral-7B	8	32	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	41.98
mixtral-7B	8	32	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	3.07
mixtral-7B	64	32	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.79
mixtral-7B	64	32	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.80
mixtral-7B	64	32	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.79
mixtral-22B	32	48	8	256	256	128	128	True	fwd	bf16	dao_ai	False	13.77
mixtral-22B	32	48	8	256	256	128	128	True	bwd	bf16	dao_ai	False	1.74
mixtral-22B	32	48	8	256	256	128	128	True	fwd_varlen	bf16	dao_ai	False	9.60
mixtral-22B	32	48	8	256	256	128	128	True	bwd_varlen	bf16	dao_ai	False	1.48
mixtral-22B	8	48	8	1024	1024	128	128	True	fwd	bf16	dao_ai	False	32.21
mixtral-22B	8	48	8	1024	1024	128	128	True	bwd	bf16	dao_ai	False	2.59
mixtral-22B	8	48	8	1024	1024	128	128	True	fwd_varlen	bf16	dao_ai	False	25.10
mixtral-22B	8	48	8	1024	1024	128	128	True	bwd_varlen	bf16	dao_ai	False	2.41
mixtral-22B	1	48	8	8192	8192	128	128	True	fwd	bf16	dao_ai	False	49.74
mixtral-22B	1	48	8	8192	8192	128	128	True	bwd	bf16	dao_ai	False	2.74
mixtral-22B	1	48	8	8192	8192	128	128	True	fwd_varlen	bf16	dao_ai	False	42.09
mixtral-22B	1	48	8	8192	8192	128	128	True	bwd_varlen	bf16	dao_ai	False	2.57
mixtral-22B	8	48	8	1024	4096	128	128	False	fwd	bf16	dao_ai	False	52.71
mixtral-22B	8	48	8	1024	4096	128	128	False	bwd	bf16	dao_ai	False	3.06
mixtral-22B	8	48	8	1024	4096	128	128	False	fwd_varlen	bf16	dao_ai	False	45.46
mixtral-22B	8	48	8	1024	4096	128	128	False	bwd_varlen	bf16	dao_ai	False	3.02
mixtral-22B	64	48	8	1	1024	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.89
mixtral-22B	64	48	8	1	4096	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.92
mixtral-22B	64	48	8	1	8192	128	128	True	fwd_kvcache	bf16	dao_ai	False	0.92
deepseek-V3	32	128	128	256	256	56	56	True	fwd	bf16	dao_ai	False	12.16
deepseek-V3	32	128	128	256	256	56	56	True	bwd	bf16	dao_ai	False	1.95
deepseek-V3	32	128	128	256	256	56	56	True	fwd_varlen	bf16	dao_ai	False	8.34
deepseek-V3	32	128	128	256	256	56	56	True	bwd_varlen	bf16	dao_ai	False	1.76
deepseek-V3	8	128	128	1024	1024	56	56	True	fwd	bf16	dao_ai	False	24.94
deepseek-V3	8	128	128	1024	1024	56	56	True	bwd	bf16	dao_ai	False	2.79
deepseek-V3	8	128	128	1024	1024	56	56	True	fwd_varlen	bf16	dao_ai	False	18.88
deepseek-V3	8	128	128	1024	1024	56	56	True	bwd_varlen	bf16	dao_ai	False	2.71
deepseek-V3	1	128	128	8192	8192	56	56	True	fwd	bf16	dao_ai	False	28.70
deepseek-V3	1	128	128	8192	8192	56	56	True	bwd	bf16	dao_ai	False	2.69
deepseek-V3	1	128	128	8192	8192	56	56	True	fwd_varlen	bf16	dao_ai	False	22.59
deepseek-V3	1	128	128	8192	8192	56	56	True	bwd_varlen	bf16	dao_ai	False	2.67
deepseek-V3	8	128	128	1024	4096	56	56	False	fwd	bf16	dao_ai	False	29.79
deepseek-V3	8	128	128	1024	4096	56	56	False	bwd	bf16	dao_ai	False	3.16
deepseek-V3	8	128	128	1024	4096	56	56	False	fwd_varlen	bf16	dao_ai	False	24.82
deepseek-V3	8	128	128	1024	4096	56	56	False	bwd_varlen	bf16	dao_ai	False	3.13
deepseek-V3	64	128	128	1	1024	56	56	True	fwd_kvcache	bf16	dao_ai	False	0.37
deepseek-V3	64	128	128	1	4096	56	56	True	fwd_kvcache	bf16	dao_ai	False	0.37

micmelesse force-pushed the micmelesse/bench_daoai branch 5 times, most recently from 112f149 to 7a6f68f Compare March 31, 2026 02:55

micmelesse marked this pull request as ready for review March 31, 2026 02:56

micmelesse requested review from a team and Copilot March 31, 2026 02:56

Copilot started reviewing on behalf of micmelesse March 31, 2026 02:58 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

micmelesse mentioned this pull request Mar 31, 2026

[ROCM] Add support with Infinity Cache (LLC) awareness for performance improvement #2483

Open

This comment was marked as resolved.

Sign in to view

micmelesse force-pushed the micmelesse/bench_daoai branch 2 times, most recently from 56a638e to 1d85ccf Compare March 31, 2026 15:35

micmelesse marked this pull request as draft March 31, 2026 17:19

This comment was marked as resolved.

Sign in to view

micmelesse force-pushed the micmelesse/bench_daoai branch 2 times, most recently from 055223f to 07d864a Compare April 1, 2026 15:39

micmelesse marked this pull request as ready for review April 1, 2026 15:39

micmelesse force-pushed the micmelesse/bench_daoai branch 10 times, most recently from 9ae8dd5 to ae5999b Compare April 7, 2026 14:47

micmelesse force-pushed the micmelesse/bench_daoai branch 4 times, most recently from 9c6caf4 to 48a086e Compare April 7, 2026 23:45

micmelesse added the ci:triton-355 label Apr 8, 2026

This comment was marked as resolved.

Sign in to view

micmelesse force-pushed the micmelesse/bench_daoai branch from 48a086e to 6b4b934 Compare April 8, 2026 18:44

address comments

15b22ff

micmelesse force-pushed the micmelesse/bench_daoai branch from 6b4b934 to 15b22ff Compare April 8, 2026 18:56

brunomazzottiamd approved these changes Apr 8, 2026

View reviewed changes

brunomazzottiamd merged commit 6c37a55 into main Apr 9, 2026
105 of 114 checks passed

brunomazzottiamd deleted the micmelesse/bench_daoai branch April 9, 2026 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton] Bench mha dao_ai impl#2542

[Triton] Bench mha dao_ai impl#2542
brunomazzottiamd merged 2 commits intomainfrom
micmelesse/bench_daoai

micmelesse commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

brunomazzottiamd left a comment

Uh oh!

brunomazzottiamd commented Apr 9, 2026

Uh oh!

Uh oh!

micmelesse commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

micmelesse commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 31, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

brunomazzottiamd left a comment

Choose a reason for hiding this comment

Uh oh!

brunomazzottiamd commented Apr 9, 2026

Uh oh!

Uh oh!

micmelesse commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micmelesse commented Mar 31, 2026 •

edited

Loading

micmelesse commented Apr 9, 2026 •

edited

Loading