Skip to content

[Triton] Bench mha dao_ai impl#2542

Merged
brunomazzottiamd merged 2 commits intomainfrom
micmelesse/bench_daoai
Apr 9, 2026
Merged

[Triton] Bench mha dao_ai impl#2542
brunomazzottiamd merged 2 commits intomainfrom
micmelesse/bench_daoai

Conversation

@micmelesse
Copy link
Copy Markdown
Contributor

@micmelesse micmelesse commented Mar 31, 2026

Motivation

This pr benches the code in flash_attention_triton_amd which is used in upstream flash attention using the bench_mha.py. It adds mha_set_impl("dao_ai") to dispatch forward/backward through the flash_attn_2 codepath, and refactors bench_mha.py. Tests in test_mha.py now parametrize over both default and dao_ai implementations.

Technical Details

Test Plan

Test Result

Submission Checklist

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2542 --add-label <label>

@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch 5 times, most recently from 112f149 to 7a6f68f Compare March 31, 2026 02:55
@micmelesse micmelesse marked this pull request as ready for review March 31, 2026 02:56
@micmelesse micmelesse requested review from a team and Copilot March 31, 2026 02:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a selectable “dao_ai” MHA forward implementation (backed by flash_attn_triton_amd / flash_attn_2) and wires it into the existing Triton MHA benchmark script and the flash-attention integration workflow so CI can produce benchmark artifacts.

Changes:

  • Add -impl {default,dao_ai} to bench_mha.py, expand benchmark config grids (including GQA + causal axis), and annotate provider labels with the impl.
  • Introduce a global MHA forward-impl switch in aiter/ops/triton/attention/mha.py that routes forward to flash_attn_2 when selected.
  • Extend .github/workflows/flash_attention_integration.yaml to run and upload dao_ai kernel + model benchmarks (and rename existing benchmark log artifacts).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
op_tests/op_benchmarks/triton/bench_mha.py Adds -impl flag, includes causal as a benchmark dimension, expands head/seq config coverage, and updates provider labeling.
aiter/ops/triton/attention/mha.py Adds global impl selection and routes forward to flash_attn_2 for dao_ai.
.github/workflows/flash_attention_integration.yaml Runs dao_ai benchmarks in CI and uploads logs/CSVs as artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/triton/attention/mha.py
Comment thread aiter/ops/triton/attention/mha.py Outdated
Comment thread aiter/ops/triton/attention/mha.py
Comment thread .github/workflows/flash_attention_integration.yaml Outdated
Comment thread op_tests/op_benchmarks/triton/bench_mha.py Outdated
brunomazzottiamd

This comment was marked as resolved.

@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch 2 times, most recently from 56a638e to 1d85ccf Compare March 31, 2026 15:35
@micmelesse micmelesse marked this pull request as draft March 31, 2026 17:19
@micmelesse

This comment was marked as resolved.

@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch 2 times, most recently from 055223f to 07d864a Compare April 1, 2026 15:39
@micmelesse micmelesse marked this pull request as ready for review April 1, 2026 15:39
@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch 10 times, most recently from 9ae8dd5 to ae5999b Compare April 7, 2026 14:47
@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch 4 times, most recently from 9c6caf4 to 48a086e Compare April 7, 2026 23:45
brunomazzottiamd

This comment was marked as resolved.

add impl bench

simple config list

edge cases

lint

more configs

seperate scenario

add batch config

split branches

print loop

limit configs

add literal

scenarios

workload function

try catch

better print

csv writer

add skip

save

save

try free

arch check

again

save

save

try again

save

fp8 skip

save

save

lint
@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch from 48a086e to 6b4b934 Compare April 8, 2026 18:44
@micmelesse micmelesse force-pushed the micmelesse/bench_daoai branch from 6b4b934 to 15b22ff Compare April 8, 2026 18:56
Copy link
Copy Markdown
Contributor

@brunomazzottiamd brunomazzottiamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's wait for the CI and then merge it.

@brunomazzottiamd
Copy link
Copy Markdown
Contributor

The following failures on Triton Tests (MI35X) / Shard 0 are expected:

=========================== short test summary info ============================
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-32-2048-2048-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-512-512-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-1024-1024-4]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-1]
FAILED op_tests/triton_tests/attention/test_mha.py::test_mha_backward[True-False-0.0-True-128-8-64-2048-2048-4]
==== 12 failed, 5269 passed, 2016 skipped, 6 warnings in 4842.04s (1:20:42) ====

Everything else have passed, we're good to merge.

@brunomazzottiamd brunomazzottiamd merged commit 6c37a55 into main Apr 9, 2026
105 of 114 checks passed
@brunomazzottiamd brunomazzottiamd deleted the micmelesse/bench_daoai branch April 9, 2026 12:37
@micmelesse
Copy link
Copy Markdown
Contributor Author

micmelesse commented Apr 9, 2026

The bench results on MI5x and RDNA3 are at https://github.com/ROCm/aiter/actions/runs/24152934159/job/70485243064. I have posted the data below.

Here are the MI5X numbers

model BATCH HQ HK N_CTX_Q N_CTX_K D_HEAD D_HEAD_V causal function dtype impl fused TFLOPS
llama3-8B 32 32 8 256 256 128 128 True fwd bf16 dao_ai False 104.37
llama3-8B 32 32 8 256 256 128 128 True bwd bf16 dao_ai False 101.09
llama3-8B 32 32 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 52.90
llama3-8B 32 32 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 59.47
llama3-8B 8 32 8 1024 1024 128 128 True fwd bf16 dao_ai False 336.49
llama3-8B 8 32 8 1024 1024 128 128 True bwd bf16 dao_ai False 192.25
llama3-8B 8 32 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 167.41
llama3-8B 8 32 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 136.63
llama3-8B 1 32 8 8192 8192 128 128 True fwd bf16 dao_ai False 568.97
llama3-8B 1 32 8 8192 8192 128 128 True bwd bf16 dao_ai False 328.52
llama3-8B 1 32 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 467.91
llama3-8B 1 32 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 275.01
llama3-8B 8 32 8 1024 4096 128 128 False fwd bf16 dao_ai False 653.23
llama3-8B 8 32 8 1024 4096 128 128 False bwd bf16 dao_ai False 329.63
llama3-8B 8 32 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 471.38
llama3-8B 8 32 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 272.11
llama3-8B 64 32 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 3.72
llama3-8B 64 32 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 4.03
llama3-8B 64 32 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 4.10
llama3-70B 32 64 8 256 256 128 128 True fwd bf16 dao_ai False 183.38
llama3-70B 32 64 8 256 256 128 128 True bwd bf16 dao_ai False 105.30
llama3-70B 32 64 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 82.60
llama3-70B 32 64 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 77.13
llama3-70B 8 64 8 1024 1024 128 128 True fwd bf16 dao_ai False 399.05
llama3-70B 8 64 8 1024 1024 128 128 True bwd bf16 dao_ai False 197.68
llama3-70B 8 64 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 239.43
llama3-70B 8 64 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 153.52
llama3-70B 1 64 8 8192 8192 128 128 True fwd bf16 dao_ai False 617.68
llama3-70B 1 64 8 8192 8192 128 128 True bwd bf16 dao_ai False 322.57
llama3-70B 1 64 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 538.18
llama3-70B 1 64 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 281.01
llama3-70B 8 64 8 1024 4096 128 128 False fwd bf16 dao_ai False 694.00
llama3-70B 8 64 8 1024 4096 128 128 False bwd bf16 dao_ai False 334.41
llama3-70B 8 64 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 538.90
llama3-70B 8 64 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 281.06
llama3-70B 64 64 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 4.00
llama3-70B 64 64 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 4.28
llama3-70B 64 64 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 4.42
llama3-405B 32 128 8 256 256 128 128 True fwd bf16 dao_ai False 191.76
llama3-405B 32 128 8 256 256 128 128 True bwd bf16 dao_ai False 107.42
llama3-405B 32 128 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 110.54
llama3-405B 32 128 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 85.40
llama3-405B 8 128 8 1024 1024 128 128 True fwd bf16 dao_ai False 419.41
llama3-405B 8 128 8 1024 1024 128 128 True bwd bf16 dao_ai False 202.22
llama3-405B 8 128 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 303.46
llama3-405B 8 128 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 165.99
llama3-405B 1 128 8 8192 8192 128 128 True fwd bf16 dao_ai False 632.13
llama3-405B 1 128 8 8192 8192 128 128 True bwd bf16 dao_ai False 319.43
llama3-405B 1 128 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 585.47
llama3-405B 1 128 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 281.29
llama3-405B 8 128 8 1024 4096 128 128 False fwd bf16 dao_ai False 686.85
llama3-405B 8 128 8 1024 4096 128 128 False bwd bf16 dao_ai False 328.35
llama3-405B 8 128 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 581.36
llama3-405B 8 128 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 283.53
llama3-405B 64 128 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 4.33
llama3-405B 64 128 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 4.70
llama3-405B 64 128 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 4.72
mixtral-7B 32 32 8 256 256 128 128 True fwd bf16 dao_ai False 101.64
mixtral-7B 32 32 8 256 256 128 128 True bwd bf16 dao_ai False 102.20
mixtral-7B 32 32 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 51.54
mixtral-7B 32 32 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 58.67
mixtral-7B 8 32 8 1024 1024 128 128 True fwd bf16 dao_ai False 336.52
mixtral-7B 8 32 8 1024 1024 128 128 True bwd bf16 dao_ai False 191.87
mixtral-7B 8 32 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 164.68
mixtral-7B 8 32 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 136.21
mixtral-7B 1 32 8 8192 8192 128 128 True fwd bf16 dao_ai False 568.21
mixtral-7B 1 32 8 8192 8192 128 128 True bwd bf16 dao_ai False 330.51
mixtral-7B 1 32 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 465.56
mixtral-7B 1 32 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 273.98
mixtral-7B 8 32 8 1024 4096 128 128 False fwd bf16 dao_ai False 652.98
mixtral-7B 8 32 8 1024 4096 128 128 False bwd bf16 dao_ai False 327.49
mixtral-7B 8 32 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 470.39
mixtral-7B 8 32 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 272.74
mixtral-7B 64 32 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 3.72
mixtral-7B 64 32 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 4.01
mixtral-7B 64 32 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 4.10
mixtral-22B 32 48 8 256 256 128 128 True fwd bf16 dao_ai False 147.98
mixtral-22B 32 48 8 256 256 128 128 True bwd bf16 dao_ai False 102.92
mixtral-22B 32 48 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 69.39
mixtral-22B 32 48 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 69.28
mixtral-22B 8 48 8 1024 1024 128 128 True fwd bf16 dao_ai False 377.39
mixtral-22B 8 48 8 1024 1024 128 128 True bwd bf16 dao_ai False 190.86
mixtral-22B 8 48 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 212.57
mixtral-22B 8 48 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 145.01
mixtral-22B 1 48 8 8192 8192 128 128 True fwd bf16 dao_ai False 593.57
mixtral-22B 1 48 8 8192 8192 128 128 True bwd bf16 dao_ai False 327.92
mixtral-22B 1 48 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 504.81
mixtral-22B 1 48 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 276.84
mixtral-22B 8 48 8 1024 4096 128 128 False fwd bf16 dao_ai False 669.42
mixtral-22B 8 48 8 1024 4096 128 128 False bwd bf16 dao_ai False 329.64
mixtral-22B 8 48 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 523.02
mixtral-22B 8 48 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 276.83
mixtral-22B 64 48 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 3.99
mixtral-22B 64 48 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 4.28
mixtral-22B 64 48 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 4.47
deepseek-V3 32 128 128 256 256 56 56 True fwd bf16 dao_ai False 96.47
deepseek-V3 32 128 128 256 256 56 56 True bwd bf16 dao_ai False 42.08
deepseek-V3 32 128 128 256 256 56 56 True fwd_varlen bf16 dao_ai False 53.76
deepseek-V3 32 128 128 256 256 56 56 True bwd_varlen bf16 dao_ai False 33.61
deepseek-V3 8 128 128 1024 1024 56 56 True fwd bf16 dao_ai False 197.39
deepseek-V3 8 128 128 1024 1024 56 56 True bwd bf16 dao_ai False 103.60
deepseek-V3 8 128 128 1024 1024 56 56 True fwd_varlen bf16 dao_ai False 142.15
deepseek-V3 8 128 128 1024 1024 56 56 True bwd_varlen bf16 dao_ai False 87.53
deepseek-V3 1 128 128 8192 8192 56 56 True fwd bf16 dao_ai False 302.74
deepseek-V3 1 128 128 8192 8192 56 56 True bwd bf16 dao_ai False 174.83
deepseek-V3 1 128 128 8192 8192 56 56 True fwd_varlen bf16 dao_ai False 264.71
deepseek-V3 1 128 128 8192 8192 56 56 True bwd_varlen bf16 dao_ai False 150.80
deepseek-V3 8 128 128 1024 4096 56 56 False fwd bf16 dao_ai False 335.64
deepseek-V3 8 128 128 1024 4096 56 56 False bwd bf16 dao_ai False 223.76
deepseek-V3 8 128 128 1024 4096 56 56 False fwd_varlen bf16 dao_ai False 292.88
deepseek-V3 8 128 128 1024 4096 56 56 False bwd_varlen bf16 dao_ai False 207.64
deepseek-V3 64 128 128 1 1024 56 56 True fwd_kvcache bf16 dao_ai False 1.66
deepseek-V3 64 128 128 1 4096 56 56 True fwd_kvcache bf16 dao_ai False 1.79
deepseek-V3 64 128 128 1 8192 56 56 True fwd_kvcache bf16 dao_ai False 1.81

Here are the RDNA3 numbers

model BATCH HQ HK N_CTX_Q N_CTX_K D_HEAD D_HEAD_V causal function dtype impl fused TFLOPS
llama3-8B 32 32 8 256 256 128 128 True fwd bf16 dao_ai False 14.75
llama3-8B 32 32 8 256 256 128 128 True bwd bf16 dao_ai False 1.74
llama3-8B 32 32 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 9.65
llama3-8B 32 32 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 1.51
llama3-8B 8 32 8 1024 1024 128 128 True fwd bf16 dao_ai False 33.82
llama3-8B 8 32 8 1024 1024 128 128 True bwd bf16 dao_ai False 2.50
llama3-8B 8 32 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 24.46
llama3-8B 8 32 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 2.38
llama3-8B 1 32 8 8192 8192 128 128 True fwd bf16 dao_ai False 49.09
llama3-8B 1 32 8 8192 8192 128 128 True bwd bf16 dao_ai False 2.67
llama3-8B 1 32 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 42.25
llama3-8B 1 32 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 2.58
llama3-8B 8 32 8 1024 4096 128 128 False fwd bf16 dao_ai False 51.93
llama3-8B 8 32 8 1024 4096 128 128 False bwd bf16 dao_ai False 3.16
llama3-8B 8 32 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 45.11
llama3-8B 8 32 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 3.11
llama3-8B 64 32 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 0.86
llama3-8B 64 32 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 0.89
llama3-8B 64 32 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 0.87
llama3-70B 32 64 8 256 256 128 128 True fwd bf16 dao_ai False 14.12
llama3-70B 32 64 8 256 256 128 128 True bwd bf16 dao_ai False 1.75
llama3-70B 32 64 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 10.07
llama3-70B 32 64 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 1.53
llama3-70B 8 64 8 1024 1024 128 128 True fwd bf16 dao_ai False 33.39
llama3-70B 8 64 8 1024 1024 128 128 True bwd bf16 dao_ai False 2.52
llama3-70B 8 64 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 25.36
llama3-70B 8 64 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 2.39
llama3-70B 1 64 8 8192 8192 128 128 True fwd bf16 dao_ai False 49.26
llama3-70B 1 64 8 8192 8192 128 128 True bwd bf16 dao_ai False 2.32
llama3-70B 1 64 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 40.94
llama3-70B 1 64 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 2.24
llama3-70B 8 64 8 1024 4096 128 128 False fwd bf16 dao_ai False 52.25
llama3-70B 8 64 8 1024 4096 128 128 False bwd bf16 dao_ai False 2.93
llama3-70B 8 64 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 44.76
llama3-70B 8 64 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 2.88
llama3-70B 64 64 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 0.89
llama3-70B 64 64 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 0.92
llama3-70B 64 64 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 0.90
llama3-405B 32 128 8 256 256 128 128 True fwd bf16 dao_ai False 14.89
llama3-405B 32 128 8 256 256 128 128 True bwd bf16 dao_ai False 1.71
llama3-405B 32 128 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 10.58
llama3-405B 32 128 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 1.48
llama3-405B 8 128 8 1024 1024 128 128 True fwd bf16 dao_ai False 34.05
llama3-405B 8 128 8 1024 1024 128 128 True bwd bf16 dao_ai False 2.32
llama3-405B 8 128 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 25.78
llama3-405B 8 128 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 2.21
llama3-405B 1 128 8 8192 8192 128 128 True fwd bf16 dao_ai False 46.43
llama3-405B 1 128 8 8192 8192 128 128 True bwd bf16 dao_ai False 1.91
llama3-405B 1 128 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 39.33
llama3-405B 1 128 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 1.79
llama3-405B 8 128 8 1024 4096 128 128 False fwd bf16 dao_ai False 48.73
llama3-405B 8 128 8 1024 4096 128 128 False bwd bf16 dao_ai False 2.75
llama3-405B 8 128 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 43.12
llama3-405B 8 128 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 2.72
llama3-405B 64 128 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 0.89
llama3-405B 64 128 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 0.93
llama3-405B 64 128 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 0.86
mixtral-7B 32 32 8 256 256 128 128 True fwd bf16 dao_ai False 13.89
mixtral-7B 32 32 8 256 256 128 128 True bwd bf16 dao_ai False 1.68
mixtral-7B 32 32 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 9.39
mixtral-7B 32 32 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 1.53
mixtral-7B 8 32 8 1024 1024 128 128 True fwd bf16 dao_ai False 31.38
mixtral-7B 8 32 8 1024 1024 128 128 True bwd bf16 dao_ai False 2.52
mixtral-7B 8 32 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 24.08
mixtral-7B 8 32 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 2.39
mixtral-7B 1 32 8 8192 8192 128 128 True fwd bf16 dao_ai False 45.25
mixtral-7B 1 32 8 8192 8192 128 128 True bwd bf16 dao_ai False 2.69
mixtral-7B 1 32 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 41.00
mixtral-7B 1 32 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 2.62
mixtral-7B 8 32 8 1024 4096 128 128 False fwd bf16 dao_ai False 48.40
mixtral-7B 8 32 8 1024 4096 128 128 False bwd bf16 dao_ai False 3.15
mixtral-7B 8 32 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 41.98
mixtral-7B 8 32 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 3.07
mixtral-7B 64 32 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 0.79
mixtral-7B 64 32 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 0.80
mixtral-7B 64 32 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 0.79
mixtral-22B 32 48 8 256 256 128 128 True fwd bf16 dao_ai False 13.77
mixtral-22B 32 48 8 256 256 128 128 True bwd bf16 dao_ai False 1.74
mixtral-22B 32 48 8 256 256 128 128 True fwd_varlen bf16 dao_ai False 9.60
mixtral-22B 32 48 8 256 256 128 128 True bwd_varlen bf16 dao_ai False 1.48
mixtral-22B 8 48 8 1024 1024 128 128 True fwd bf16 dao_ai False 32.21
mixtral-22B 8 48 8 1024 1024 128 128 True bwd bf16 dao_ai False 2.59
mixtral-22B 8 48 8 1024 1024 128 128 True fwd_varlen bf16 dao_ai False 25.10
mixtral-22B 8 48 8 1024 1024 128 128 True bwd_varlen bf16 dao_ai False 2.41
mixtral-22B 1 48 8 8192 8192 128 128 True fwd bf16 dao_ai False 49.74
mixtral-22B 1 48 8 8192 8192 128 128 True bwd bf16 dao_ai False 2.74
mixtral-22B 1 48 8 8192 8192 128 128 True fwd_varlen bf16 dao_ai False 42.09
mixtral-22B 1 48 8 8192 8192 128 128 True bwd_varlen bf16 dao_ai False 2.57
mixtral-22B 8 48 8 1024 4096 128 128 False fwd bf16 dao_ai False 52.71
mixtral-22B 8 48 8 1024 4096 128 128 False bwd bf16 dao_ai False 3.06
mixtral-22B 8 48 8 1024 4096 128 128 False fwd_varlen bf16 dao_ai False 45.46
mixtral-22B 8 48 8 1024 4096 128 128 False bwd_varlen bf16 dao_ai False 3.02
mixtral-22B 64 48 8 1 1024 128 128 True fwd_kvcache bf16 dao_ai False 0.89
mixtral-22B 64 48 8 1 4096 128 128 True fwd_kvcache bf16 dao_ai False 0.92
mixtral-22B 64 48 8 1 8192 128 128 True fwd_kvcache bf16 dao_ai False 0.92
deepseek-V3 32 128 128 256 256 56 56 True fwd bf16 dao_ai False 12.16
deepseek-V3 32 128 128 256 256 56 56 True bwd bf16 dao_ai False 1.95
deepseek-V3 32 128 128 256 256 56 56 True fwd_varlen bf16 dao_ai False 8.34
deepseek-V3 32 128 128 256 256 56 56 True bwd_varlen bf16 dao_ai False 1.76
deepseek-V3 8 128 128 1024 1024 56 56 True fwd bf16 dao_ai False 24.94
deepseek-V3 8 128 128 1024 1024 56 56 True bwd bf16 dao_ai False 2.79
deepseek-V3 8 128 128 1024 1024 56 56 True fwd_varlen bf16 dao_ai False 18.88
deepseek-V3 8 128 128 1024 1024 56 56 True bwd_varlen bf16 dao_ai False 2.71
deepseek-V3 1 128 128 8192 8192 56 56 True fwd bf16 dao_ai False 28.70
deepseek-V3 1 128 128 8192 8192 56 56 True bwd bf16 dao_ai False 2.69
deepseek-V3 1 128 128 8192 8192 56 56 True fwd_varlen bf16 dao_ai False 22.59
deepseek-V3 1 128 128 8192 8192 56 56 True bwd_varlen bf16 dao_ai False 2.67
deepseek-V3 8 128 128 1024 4096 56 56 False fwd bf16 dao_ai False 29.79
deepseek-V3 8 128 128 1024 4096 56 56 False bwd bf16 dao_ai False 3.16
deepseek-V3 8 128 128 1024 4096 56 56 False fwd_varlen bf16 dao_ai False 24.82
deepseek-V3 8 128 128 1024 4096 56 56 False bwd_varlen bf16 dao_ai False 3.13
deepseek-V3 64 128 128 1 1024 56 56 True fwd_kvcache bf16 dao_ai False 0.37
deepseek-V3 64 128 128 1 4096 56 56 True fwd_kvcache bf16 dao_ai False 0.37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants