Add Metal attention benchmark tool by Kingwl · Pull Request #178 · vllm-project/vllm-metal

Kingwl · 2026-03-18T13:29:45Z

Summary

This PR adds a local Metal attention benchmark under tools/benchmark/attention_benchmark.py.

It supports both preset workloads and fully manual one-off workloads. Presets are defined directly in Python as built-in CASES and GROUPS, with built-in groups such as all, decode, varlen, small, typical, and long.

The benchmark supports num_layers for multi-layer benchmarking and reports per-layer average latency, following the upstream design.

By default, preset runs compare v1, v2, textbook, and sdpa. Passing --backend all also includes sdpa-compute-only as an additional baseline.

The benchmark prints a text table to stdout and also supports --output-json and --output-csv exports.

It also factors shared benchmark/reference helpers into tools/attention_bench_utils.py and adds usage notes to tools/README.md.

Results may vary slightly run to run.

Single Layer Metal Attention Benchmark

cases: decode-small, decode-typical, decode-big-head, decode-long, varlen-light, varlen-typical, varlen-single-long, varlen-ragged-longtail
num_layers: 1  block_size: 16  dtype: float16  warmup: 10  iters: 100  seed: 0

case            | type   | batch | shape                            | v1    | v1_vs_best  | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
----------------+--------+-------+----------------------------------+-------+-------------+-------+-------------+----------+------------------+-------+-------------
small           | decode | 1     | B=1, q=1, kv=128                 | 0.319 | 111.8%      | 0.286 | 100.1%      | 0.286    | 100.0% best      | 0.337 | 118.0%
typical         | decode | 8     | B=8, q=1, kv=2048                | 0.328 | 108.0%      | 0.304 | 100.0% best | 0.957    | 314.6%           | 0.881 | 289.6%
big-head        | decode | 8     | B=8, q=1, kv=2048                | 0.477 | 100.0% best | 0.514 | 107.7%      | 3.900    | 817.7%           | 1.244 | 260.7%
long            | decode | 32    | B=32, q=1, kv=8192               | 1.121 | 107.8%      | 1.041 | 100.0% best | 6.615    | 635.6%           | 5.112 | 491.2%
light           | varlen | 4     | 1/128 4/256 16/512 64/1024       | N/A   | -           | 0.432 | 100.0% best | 0.551    | 127.5%           | 0.693 | 160.4%
typical         | varlen | 4     | 32/512 64/1024 128/2048 256/4096 | N/A   | -           | 2.798 | 199.8%      | 1.849    | 132.1%           | 1.400 | 100.0% best
single-long     | varlen | 1     | 256/4096                         | N/A   | -           | 2.183 | 198.1%      | 1.143    | 103.7%           | 1.102 | 100.0% best
ragged-longtail | varlen | 4     | 1/4096 1/8192 8/512 128/2048     | N/A   | -           | 0.990 | 100.0% best | 1.024    | 103.5%           | 1.100 | 111.2%

Multiple Layer Metal Attention Benchmark

cases: decode-small, decode-typical, decode-big-head, decode-long, varlen-light, varlen-typical, varlen-single-long, varlen-ragged-longtail
num_layers: 10  block_size: 16  dtype: float16  warmup: 10  iters: 100  seed: 0

case            | type   | batch | shape                            | v1    | v1_vs_best  | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
----------------+--------+-------+----------------------------------+-------+-------------+-------+-------------+----------+------------------+-------+-------------
small           | decode | 1     | B=1, q=1, kv=128                 | 0.244 | 529.2%      | 0.222 | 481.4%      | 0.046    | 100.0% best      | 0.183 | 396.7%
typical         | decode | 8     | B=8, q=1, kv=2048                | 0.311 | 100.4%      | 0.310 | 100.0% best | 0.371    | 119.7%           | 0.807 | 260.7%
big-head        | decode | 8     | B=8, q=1, kv=2048                | 0.552 | 100.0% best | 0.572 | 103.7%      | 0.652    | 118.2%           | 0.901 | 163.2%
long            | decode | 32    | B=32, q=1, kv=8192               | 1.169 | 106.2%      | 1.101 | 100.0% best | 1.634    | 148.4%           | 4.365 | 396.6%
light           | varlen | 4     | 1/128 4/256 16/512 64/1024       | N/A   | -           | 0.459 | 270.3%      | 0.170    | 100.0% best      | 0.411 | 241.7%
typical         | varlen | 4     | 32/512 64/1024 128/2048 256/4096 | N/A   | -           | 2.934 | 891.9%      | 0.329    | 100.0% best      | 0.601 | 182.8%
single-long     | varlen | 1     | 256/4096                         | N/A   | -           | 2.449 | 1402.8%     | 0.175    | 100.0% best      | 0.365 | 209.1%
ragged-longtail | varlen | 4     | 1/4096 1/8192 8/512 128/2048     | N/A   | -           | 1.031 | 406.7%      | 0.253    | 100.0% best      | 0.628 | 247.9%

Single-layer summary:

On decode, v2 is already the best backend on the main representative cases (decode-typical and decode-long).
v1 still wins on decode-big-head, so large-head decode remains a clear gap for v2.
On varlen, the picture is mixed: v2 wins varlen-light and varlen-ragged-longtail, while sdpa wins the heavier cases (varlen-typical and varlen-single-long).

Compared with the multi-layer run (num_layers=10):

The decode ranking is largely unchanged: v2 still leads on decode-typical and decode-long, and v1 still leads on decode-big-head.
The biggest change is on the varlen side. In the single-layer run, wins were split between v2 and sdpa; in the 10-layer run, all current varlen cases shift to textbook.

Fixed #175

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5833d593e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tools/bench_metal_attention.py

chatgpt-codex-connector · 2026-03-18T13:34:30Z

tools/bench_metal_attention.py

+                tokens_per_s=None,
+                notes="unsupported in varlen mode",
+            )
+        fn = lambda: run_v1_paged_attention(


Time v1 backend without helper-side barriers

time_backend already synchronizes before and after every timed iteration, but the v1 path calls run_v1_paged_attention, which itself performs mx.eval(...) and mx.synchronize() each call. That adds extra fixed overhead only to v1, so the reported v1 latency/tokens/s is inflated relative to v2/textbook and the cross-backend comparison is not apples-to-apples.

Useful? React with 👍 / 👎.

tools/bench_metal_attention.py

tools/attention_bench_utils.py

tests/test_metal_unified_attention.py

tools/bench_metal_attention.py

README.md

LxYuan0420

Left some comment; do use -s when commit file

WindChimeRan · 2026-03-18T19:13:19Z

@Kingwl Thanks for the contributions!

please check the upstream vllm/benchmarks/attention_benchmarks/, and briefly discuss the design trade-off in the PR description. e.g., This PR only benchmark single layer attn, is it good enough or should we extend it to multi-layer in the future (this might be an overkill, I'm not sure)?
output format: plain text table benchmark output (especially in this size), is difficult for both human and ai agent to read and analysis. You may refer to the upstream benchmark to polish the output a bit.
Small PR description formating issue: Can you render the result table to acutual markdown table? The current version is not very human-readable.
It would be better if you can include your own benchmark result interpretations in PR description. This will give us future direction on kernel optimization. (e.g., v2 is not good inprefill-single-long)

Notes: Please don't spend too much energy on the v1 kernel. Once we are sure v2>>v1, we should then deprecate v1 immediately because it has design flaws.

Kingwl · 2026-03-19T10:39:56Z

Added supports for multiple layer benchmark.
Added configs yaml to specify test case shape.
Supported csv + json output file
Adjusted benchmark result table.
Added some simple benchmark result interpretations.
Updated PR description.

WindChimeRan · 2026-03-19T14:54:44Z

LGTM.

Do you want to take another look? I didn't check it line by line. @LxYuan0420

LxYuan0420

One minor feedback to change before we merge.

Good work overall!

LxYuan0420 · 2026-03-20T03:27:35Z

tools/benchmark/attention_benchmark.py

Would prefer to keep this benchmark self-contained in this file, but avoid introducing a small DSL here.

As a general principle, for developer-only tooling we should prefer plain Python over a config DSL unless there is a clear payoff, for example non-engineers need to edit it or the same config must be shared across multiple tools/languages. Here, configs/*.yaml plus the merge logic effectively become a mini language for benchmark presets, and I do not think that extra layer is buying us much.

I think this would be simpler if we keep the preset catalog directly in attention_benchmark.py as built-in constants, and support:

built-in groups such as all, decode, varlen, small, typical, long

explicit case selection

fully manual one-off workloads via the existing flags

One important detail: please keep q_lens / kv_lens typed in code as tuples/lists, rather than comma-separated strings, so we do not just move the DSL from YAML into Python strings.

Something like this would be much easier to maintain:

- # YAML config files + loader + merge logic + DEFAULTS = dict( + backend="v1,v2,textbook,sdpa", + num_q_heads=8, + num_kv_heads=8, + head_dim=128, + block_size=16, + num_blocks=256, + dtype="float16", + warmup=10, + iters=100, + seed=0, + num_layers=1, + ) + + CASES = { + "decode-small": dict(mode="decode", batch_size=1, kv_lens=(128,)), + "decode-typical": dict(mode="decode", batch_size=8, kv_lens=(2048,)), + "decode-big-head": dict( + mode="decode", + batch_size=8, + kv_lens=(2048,), + num_q_heads=32, + num_kv_heads=8, + head_dim=256, + ), + "decode-long": dict( + mode="decode", + batch_size=32, + kv_lens=(8192,), + num_blocks=512, + ), + "varlen-light": dict( + mode="varlen", + q_lens=(1, 4, 16, 64), + kv_lens=(128, 256, 512, 1024), + ), + "varlen-typical": dict( + mode="varlen", + q_lens=(32, 64, 128, 256), + kv_lens=(512, 1024, 2048, 4096), + ), + "varlen-single-long": dict( + mode="varlen", + q_lens=(256,), + kv_lens=(4096,), + ), + "varlen-ragged-longtail": dict( + mode="varlen", + q_lens=(1, 1, 8, 128), + kv_lens=(4096, 8192, 512, 2048), + num_blocks=512, + ), + } + + GROUPS = { + "all": tuple(CASES), + "decode": tuple(name for name in CASES if name.startswith("decode-")), + "varlen": tuple(name for name in CASES if name.startswith("varlen-")), + "small": ("decode-small", "varlen-light"), + "typical": ("decode-typical", "varlen-typical"), + "long": ( + "decode-big-head", + "decode-long", + "varlen-single-long", + "varlen-ragged-longtail", + ), + }

That would give a simpler user-facing interface like:

python -m tools.benchmark.attention_benchmark python -m tools.benchmark.attention_benchmark --group decode python -m tools.benchmark.attention_benchmark --group small python -m tools.benchmark.attention_benchmark --cases decode-small,varlen-light python -m tools.benchmark.attention_benchmark --mode decode --batch-size 8 --kv-lens 2048 python -m tools.benchmark.attention_benchmark --group decode --num-layers 10 --iters 200

This keeps the tool self-contained, removes YAML / PyYAML usage from this benchmark path, and makes the preset catalog easier to read, validate, and refactor.

Please also revised the tools/README.md and ensure it is easy to understand and run specific command

Signed-off-by: kingwl <kingwenlu@gmail.com>

LxYuan0420

@Kingwl Looks good. One thing before I approve:

Could you please run these and paste the output in the PR, so we can confirm the tool works end-to-end as documented?

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2

python -m tools.benchmark.attention_benchmark --mode decode --batch-size 4 --kv-lens 512 --backend v1,v2 --iters 5 --warmup 2

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2 --output-json /tmp/attention.json

cat /tmp/attention.json

Kingwl · 2026-03-20T07:48:32Z

Sure

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2

Metal Attention Benchmark
cases: decode-small, varlen-light
num_layers: 1  block_size: 16  dtype: float16  warmup: 2  iters: 5  seed: 0

case  | type   | batch | shape                      | v1    | v1_vs_best | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
------+--------+-------+----------------------------+-------+------------+-------+-------------+----------+------------------+-------+-------------
small | decode | 1     | B=1, q=1, kv=128           | 0.408 | 120.3%     | 0.363 | 107.1%      | 0.339    | 100.0% best      | 0.409 | 120.4%      
light | varlen | 4     | 1/128 4/256 16/512 64/1024 | N/A   | -          | 1.153 | 100.0% best | 1.185    | 102.8%           | 1.489 | 129.1%

python -m tools.benchmark.attention_benchmark --mode decode --batch-size 4 --kv-lens 512 --backend v1,v2 --iters 5 --warmup 2

Metal Attention Benchmark
case: custom
mode: decode
workload: batch=4, q_len=1, kv_len=[512, 512, 512, 512]
heads(q/kv): 8/8  head_dim: 128  block_size: 16  num_blocks: 256  num_layers: 1
dtype: float16  warmup: 2  iters: 5  seed: 0

case   | type   | batch | shape            | v1    | v1_vs_best  | v2    | v2_vs_best
-------+--------+-------+------------------+-------+-------------+-------+-----------
custom | decode | 4     | B=4, q=1, kv=512 | 0.428 | 100.0% best | 0.455 | 106.4%

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2 --output-json /tmp/attention.json

cat /tmp/attention.json

Metal Attention Benchmark
cases: decode-small, varlen-light
num_layers: 1  block_size: 16  dtype: float16  warmup: 2  iters: 5  seed: 0

case  | type   | batch | shape                      | v1    | v1_vs_best | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
------+--------+-------+----------------------------+-------+------------+-------+-------------+----------+------------------+-------+-------------
small | decode | 1     | B=1, q=1, kv=128           | 0.371 | 117.7%     | 0.317 | 100.6%      | 0.315    | 100.0% best      | 0.465 | 147.8%      
light | varlen | 4     | 1/128 4/256 16/512 64/1024 | N/A   | -          | 1.176 | 100.0% best | 1.186    | 100.8%           | 1.462 | 124.3%      
{
  "summary": {
    "cases": [
      "decode-small",
      "varlen-light"
    ],
    "num_layers": 1,
    "block_size": 16,
    "dtype": "float16",
    "warmup": 2,
    "iters": 5,
    "seed": 0
  },
  "rows": [
    {
      "case": "small",
      "case_name": "decode-small",
      "type": "decode",
      "batch": 1,
      "shape": "B=1, q=1, kv=128",
      "v1": 0.371,
      "v1_vs_best": 117.7,
      "v2": 0.317,
      "v2_vs_best": 100.6,
      "textbook": 0.315,
      "textbook_vs_best": 100.0,
      "sdpa": 0.465,
      "sdpa_vs_best": 147.8
    },
    {
      "case": "light",
      "case_name": "varlen-light",
      "type": "varlen",
      "batch": 4,
      "shape": "1/128 4/256 16/512 64/1024",
      "v1": null,
      "v1_vs_best": null,
      "v2": 1.176,
      "v2_vs_best": 100.0,
      "textbook": 1.186,
      "textbook_vs_best": 100.8,
      "sdpa": 1.462,
      "sdpa_vs_best": 124.3
    }
  ]
}

LxYuan0420

LGTM

LxYuan0420 · 2026-03-20T07:59:52Z

@Kingwl Could you please run the formatter / lint / type-check locally and push a fix?

WindChimeRan · 2026-03-20T08:05:14Z

@Kingwl Could you please provide any hypothesis on why v1 > v2 on decode-big-head (and is it significantly better or just an acceptable regression)?

This question is not blocking. Please feel free to merge.

I'm afraid that we have to deprecate v1 soon because it's not compatible with mixed prefiling & decoding. I just don't want to bury this key findings.

Signed-off-by: kingwl <kingwenlu@gmail.com>

Kingwl · 2026-03-20T08:36:30Z

Fixed lint/format/typecheck issues with a new commit.

Kingwl · 2026-03-20T09:33:17Z

I ran a benchmark matrix:

ASSUMPTIONS: decode, batch_size=8, kv_lens=2048, backend=v1,v2, warmup=10, iters=500, dtype=float16, num_layers=1, repeats=5

| num_q_heads | num_kv_heads | q_per_kv | head_dim | v1 mean ms | v1 std | v2 mean ms | v2 std | v2 / v1 |
|---:|---:|---:|---:|---:|---:|---:|---:|:---|
| 8 | 8 | 1 | 128 | 0.377 | 0.101 | 0.332 | 0.027 | 0.88x |
| 8 | 8 | 1 | 256 | 0.397 | 0.029 | 0.399 | 0.016 | 1.01x slower |
| 8 | 4 | 2 | 128 | 0.323 | 0.013 | 0.296 | 0.004 | 0.92x |
| 8 | 4 | 2 | 256 | 0.370 | 0.093 | 0.344 | 0.024 | 0.93x |
| 8 | 2 | 4 | 128 | 0.312 | 0.012 | 0.294 | 0.003 | 0.94x |
| 8 | 2 | 4 | 256 | 0.343 | 0.004 | 0.322 | 0.005 | 0.94x |
| 16 | 8 | 2 | 128 | 0.339 | 0.008 | 0.326 | 0.007 | 0.96x |
| 16 | 8 | 2 | 256 | 0.448 | 0.020 | 0.448 | 0.014 | 1.00x |
| 16 | 4 | 4 | 128 | 0.345 | 0.013 | 0.339 | 0.025 | 0.98x |
| 16 | 4 | 4 | 256 | 0.448 | 0.061 | 0.434 | 0.021 | 0.97x |
| 16 | 2 | 8 | 128 | 0.326 | 0.008 | 0.331 | 0.023 | 1.02x slower |
| 16 | 2 | 8 | 256 | 0.435 | 0.014 | 0.445 | 0.015 | 1.02x slower |
| 32 | 8 | 4 | 128 | 0.443 | 0.011 | 0.428 | 0.008 | 0.97x |
| 32 | 8 | 4 | 256 | 0.564 | 0.005 | 0.574 | 0.002 | 1.02x slower |
| 32 | 4 | 8 | 128 | 0.456 | 0.012 | 0.443 | 0.003 | 0.97x |
| 32 | 4 | 8 | 256 | 0.550 | 0.014 | 0.591 | 0.053 | 1.07x slower |
| 32 | 2 | 16 | 128 | 0.443 | 0.008 | 0.434 | 0.010 | 0.98x |
| 32 | 2 | 16 | 256 | 0.555 | 0.008 | 0.566 | 0.002 | 1.02x slower |

Upper to 1.0 line means v2 is slower:

Current repeated decode benchmarks do not show a broad v2 regression versus v1. The main sensitivity appears to come from larger head_dim and, secondarily, larger num_q_heads / higher q_per_kv, which can push some shapes from slightly faster to roughly parity or slightly slower.

LxYuan0420 · 2026-03-20T14:06:31Z

@WindChimeRan WDYT?

WindChimeRan · 2026-03-20T20:24:30Z

I ran a benchmark matrix:

ASSUMPTIONS: decode, batch_size=8, kv_lens=2048, backend=v1,v2, warmup=10, iters=500, dtype=float16, num_layers=1, repeats=5

| num_q_heads | num_kv_heads | q_per_kv | head_dim | v1 mean ms | v1 std | v2 mean ms | v2 std | v2 / v1 |
|---:|---:|---:|---:|---:|---:|---:|---:|:---|
| 8 | 8 | 1 | 128 | 0.377 | 0.101 | 0.332 | 0.027 | 0.88x |
| 8 | 8 | 1 | 256 | 0.397 | 0.029 | 0.399 | 0.016 | 1.01x slower |
| 8 | 4 | 2 | 128 | 0.323 | 0.013 | 0.296 | 0.004 | 0.92x |
| 8 | 4 | 2 | 256 | 0.370 | 0.093 | 0.344 | 0.024 | 0.93x |
| 8 | 2 | 4 | 128 | 0.312 | 0.012 | 0.294 | 0.003 | 0.94x |
| 8 | 2 | 4 | 256 | 0.343 | 0.004 | 0.322 | 0.005 | 0.94x |
| 16 | 8 | 2 | 128 | 0.339 | 0.008 | 0.326 | 0.007 | 0.96x |
| 16 | 8 | 2 | 256 | 0.448 | 0.020 | 0.448 | 0.014 | 1.00x |
| 16 | 4 | 4 | 128 | 0.345 | 0.013 | 0.339 | 0.025 | 0.98x |
| 16 | 4 | 4 | 256 | 0.448 | 0.061 | 0.434 | 0.021 | 0.97x |
| 16 | 2 | 8 | 128 | 0.326 | 0.008 | 0.331 | 0.023 | 1.02x slower |
| 16 | 2 | 8 | 256 | 0.435 | 0.014 | 0.445 | 0.015 | 1.02x slower |
| 32 | 8 | 4 | 128 | 0.443 | 0.011 | 0.428 | 0.008 | 0.97x |
| 32 | 8 | 4 | 256 | 0.564 | 0.005 | 0.574 | 0.002 | 1.02x slower |
| 32 | 4 | 8 | 128 | 0.456 | 0.012 | 0.443 | 0.003 | 0.97x |
| 32 | 4 | 8 | 256 | 0.550 | 0.014 | 0.591 | 0.053 | 1.07x slower |
| 32 | 2 | 16 | 128 | 0.443 | 0.008 | 0.434 | 0.010 | 0.98x |
| 32 | 2 | 16 | 256 | 0.555 | 0.008 | 0.566 | 0.002 | 1.02x slower |

Upper to 1.0 line means v2 is slower:

Current repeated decode benchmarks do not show a broad v2 regression versus v1. The main sensitivity appears to come from larger head_dim and, secondarily, larger num_q_heads / higher q_per_kv, which can push some shapes from slightly faster to roughly parity or slightly slower.

Thanks for the in-depth analysis. The regression seems to be non-significant and acceptable.

So I think It's a good time to deprecate the whole v1 kernels now (because v1 has no varlen and inherently incompatible with vllm v1's unified prefilling and decodiing).

Side notes from claude code opus 4.6 on hypothesis (I haven't taken a closer look):

The most actionable optimization would be to profile register spills using Metal's GPU profiler (Xcode Instruments → Metal System Trace) at head_dim=256 vs 128. If spills appear at 256, the fix might be as simple as reducing NUM_WARPS from 8 to 4 for the large-head codepath.

WindChimeRan · 2026-03-20T20:26:57Z

BTW, because of the merge of #172 , now kernel v1 and kernel v1 benchmark become dead code. We need to clean them up in the following PRs.

Kingwl force-pushed the feat/add-benchmark-script branch 2 times, most recently from 5833d59 to 9df31ac Compare March 18, 2026 13:30

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

Kingwl force-pushed the feat/add-benchmark-script branch from 9df31ac to e739db0 Compare March 18, 2026 13:41