Skip to content

Add Metal attention benchmark tool#178

Merged
WindChimeRan merged 2 commits intovllm-project:mainfrom
Kingwl:feat/add-benchmark-script
Mar 20, 2026
Merged

Add Metal attention benchmark tool#178
WindChimeRan merged 2 commits intovllm-project:mainfrom
Kingwl:feat/add-benchmark-script

Conversation

@Kingwl
Copy link
Copy Markdown
Contributor

@Kingwl Kingwl commented Mar 18, 2026

Summary

This PR adds a local Metal attention benchmark under tools/benchmark/attention_benchmark.py.

It supports both preset workloads and fully manual one-off workloads. Presets are defined directly in Python as built-in CASES and GROUPS, with built-in groups such as all, decode, varlen, small, typical, and long.

The benchmark supports num_layers for multi-layer benchmarking and reports per-layer average latency, following the upstream design.

By default, preset runs compare v1, v2, textbook, and sdpa. Passing --backend all also includes sdpa-compute-only as an additional baseline.

The benchmark prints a text table to stdout and also supports --output-json and --output-csv exports.

It also factors shared benchmark/reference helpers into tools/attention_bench_utils.py and adds usage notes to tools/README.md.

Results may vary slightly run to run.

Single Layer Metal Attention Benchmark

cases: decode-small, decode-typical, decode-big-head, decode-long, varlen-light, varlen-typical, varlen-single-long, varlen-ragged-longtail
num_layers: 1  block_size: 16  dtype: float16  warmup: 10  iters: 100  seed: 0

case            | type   | batch | shape                            | v1    | v1_vs_best  | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
----------------+--------+-------+----------------------------------+-------+-------------+-------+-------------+----------+------------------+-------+-------------
small           | decode | 1     | B=1, q=1, kv=128                 | 0.319 | 111.8%      | 0.286 | 100.1%      | 0.286    | 100.0% best      | 0.337 | 118.0%
typical         | decode | 8     | B=8, q=1, kv=2048                | 0.328 | 108.0%      | 0.304 | 100.0% best | 0.957    | 314.6%           | 0.881 | 289.6%
big-head        | decode | 8     | B=8, q=1, kv=2048                | 0.477 | 100.0% best | 0.514 | 107.7%      | 3.900    | 817.7%           | 1.244 | 260.7%
long            | decode | 32    | B=32, q=1, kv=8192               | 1.121 | 107.8%      | 1.041 | 100.0% best | 6.615    | 635.6%           | 5.112 | 491.2%
light           | varlen | 4     | 1/128 4/256 16/512 64/1024       | N/A   | -           | 0.432 | 100.0% best | 0.551    | 127.5%           | 0.693 | 160.4%
typical         | varlen | 4     | 32/512 64/1024 128/2048 256/4096 | N/A   | -           | 2.798 | 199.8%      | 1.849    | 132.1%           | 1.400 | 100.0% best
single-long     | varlen | 1     | 256/4096                         | N/A   | -           | 2.183 | 198.1%      | 1.143    | 103.7%           | 1.102 | 100.0% best
ragged-longtail | varlen | 4     | 1/4096 1/8192 8/512 128/2048     | N/A   | -           | 0.990 | 100.0% best | 1.024    | 103.5%           | 1.100 | 111.2%


Multiple Layer Metal Attention Benchmark

cases: decode-small, decode-typical, decode-big-head, decode-long, varlen-light, varlen-typical, varlen-single-long, varlen-ragged-longtail
num_layers: 10  block_size: 16  dtype: float16  warmup: 10  iters: 100  seed: 0

case            | type   | batch | shape                            | v1    | v1_vs_best  | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
----------------+--------+-------+----------------------------------+-------+-------------+-------+-------------+----------+------------------+-------+-------------
small           | decode | 1     | B=1, q=1, kv=128                 | 0.244 | 529.2%      | 0.222 | 481.4%      | 0.046    | 100.0% best      | 0.183 | 396.7%
typical         | decode | 8     | B=8, q=1, kv=2048                | 0.311 | 100.4%      | 0.310 | 100.0% best | 0.371    | 119.7%           | 0.807 | 260.7%
big-head        | decode | 8     | B=8, q=1, kv=2048                | 0.552 | 100.0% best | 0.572 | 103.7%      | 0.652    | 118.2%           | 0.901 | 163.2%
long            | decode | 32    | B=32, q=1, kv=8192               | 1.169 | 106.2%      | 1.101 | 100.0% best | 1.634    | 148.4%           | 4.365 | 396.6%
light           | varlen | 4     | 1/128 4/256 16/512 64/1024       | N/A   | -           | 0.459 | 270.3%      | 0.170    | 100.0% best      | 0.411 | 241.7%
typical         | varlen | 4     | 32/512 64/1024 128/2048 256/4096 | N/A   | -           | 2.934 | 891.9%      | 0.329    | 100.0% best      | 0.601 | 182.8%
single-long     | varlen | 1     | 256/4096                         | N/A   | -           | 2.449 | 1402.8%     | 0.175    | 100.0% best      | 0.365 | 209.1%
ragged-longtail | varlen | 4     | 1/4096 1/8192 8/512 128/2048     | N/A   | -           | 1.031 | 406.7%      | 0.253    | 100.0% best      | 0.628 | 247.9%

Single-layer summary:

  • On decode, v2 is already the best backend on the main representative cases (decode-typical and decode-long).
  • v1 still wins on decode-big-head, so large-head decode remains a clear gap for v2.
  • On varlen, the picture is mixed: v2 wins varlen-light and varlen-ragged-longtail, while sdpa wins the heavier cases (varlen-typical and varlen-single-long).

Compared with the multi-layer run (num_layers=10):

  • The decode ranking is largely unchanged: v2 still leads on decode-typical and decode-long, and v1 still leads on decode-big-head.
  • The biggest change is on the varlen side. In the single-layer run, wins were split between v2 and sdpa; in the 10-layer run, all current varlen cases shift to textbook.

Fixed #175

@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch 2 times, most recently from 5833d59 to 9df31ac Compare March 18, 2026 13:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5833d593e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

tokens_per_s=None,
notes="unsupported in varlen mode",
)
fn = lambda: run_v1_paged_attention(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Time v1 backend without helper-side barriers

time_backend already synchronizes before and after every timed iteration, but the v1 path calls run_v1_paged_attention, which itself performs mx.eval(...) and mx.synchronize() each call. That adds extra fixed overhead only to v1, so the reported v1 latency/tokens/s is inflated relative to v2/textbook and the cross-backend comparison is not apples-to-apples.

Useful? React with 👍 / 👎.

@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch from 9df31ac to e739db0 Compare March 18, 2026 13:41
@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch from e739db0 to 8d9bfb7 Compare March 18, 2026 15:22
Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comment; do use -s when commit file

@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch 2 times, most recently from 67ca1d3 to 8fb00e2 Compare March 18, 2026 15:35
@WindChimeRan
Copy link
Copy Markdown
Collaborator

WindChimeRan commented Mar 18, 2026

@Kingwl Thanks for the contributions!

  • please check the upstream vllm/benchmarks/attention_benchmarks/, and briefly discuss the design trade-off in the PR description. e.g., This PR only benchmark single layer attn, is it good enough or should we extend it to multi-layer in the future (this might be an overkill, I'm not sure)?
  • output format: plain text table benchmark output (especially in this size), is difficult for both human and ai agent to read and analysis. You may refer to the upstream benchmark to polish the output a bit.
  • Small PR description formating issue: Can you render the result table to acutual markdown table? The current version is not very human-readable.
  • It would be better if you can include your own benchmark result interpretations in PR description. This will give us future direction on kernel optimization. (e.g., v2 is not good inprefill-single-long)

Notes: Please don't spend too much energy on the v1 kernel. Once we are sure v2>>v1, we should then deprecate v1 immediately because it has design flaws.

@Kingwl
Copy link
Copy Markdown
Contributor Author

Kingwl commented Mar 19, 2026

  • Added supports for multiple layer benchmark.
  • Added configs yaml to specify test case shape.
  • Supported csv + json output file
  • Adjusted benchmark result table.
  • Added some simple benchmark result interpretations.
  • Updated PR description.

@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch 2 times, most recently from 63426ab to 0e67258 Compare March 19, 2026 11:41
@WindChimeRan
Copy link
Copy Markdown
Collaborator

LGTM.

Do you want to take another look? I didn't check it line by line. @LxYuan0420

Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor feedback to change before we merge.

Good work overall!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to keep this benchmark self-contained in this file, but avoid introducing a small DSL here.

As a general principle, for developer-only tooling we should prefer plain Python over a config DSL unless there is a clear payoff, for example non-engineers need to edit it or the same config must be shared across multiple tools/languages. Here, configs/*.yaml plus the merge logic effectively become a mini language for benchmark presets, and I do not think that extra layer is buying us much.

I think this would be simpler if we keep the preset catalog directly in attention_benchmark.py as built-in constants, and support:

  • built-in groups such as all, decode, varlen, small, typical, long
  • explicit case selection
  • fully manual one-off workloads via the existing flags

One important detail: please keep q_lens / kv_lens typed in code as tuples/lists, rather than comma-separated strings, so we do not just move the DSL from YAML into Python strings.

Something like this would be much easier to maintain:

- # YAML config files + loader + merge logic
+ DEFAULTS = dict(
+     backend="v1,v2,textbook,sdpa",
+     num_q_heads=8,
+     num_kv_heads=8,
+     head_dim=128,
+     block_size=16,
+     num_blocks=256,
+     dtype="float16",
+     warmup=10,
+     iters=100,
+     seed=0,
+     num_layers=1,
+ )
+
+ CASES = {
+     "decode-small": dict(mode="decode", batch_size=1, kv_lens=(128,)),
+     "decode-typical": dict(mode="decode", batch_size=8, kv_lens=(2048,)),
+     "decode-big-head": dict(
+         mode="decode",
+         batch_size=8,
+         kv_lens=(2048,),
+         num_q_heads=32,
+         num_kv_heads=8,
+         head_dim=256,
+     ),
+     "decode-long": dict(
+         mode="decode",
+         batch_size=32,
+         kv_lens=(8192,),
+         num_blocks=512,
+     ),
+     "varlen-light": dict(
+         mode="varlen",
+         q_lens=(1, 4, 16, 64),
+         kv_lens=(128, 256, 512, 1024),
+     ),
+     "varlen-typical": dict(
+         mode="varlen",
+         q_lens=(32, 64, 128, 256),
+         kv_lens=(512, 1024, 2048, 4096),
+     ),
+     "varlen-single-long": dict(
+         mode="varlen",
+         q_lens=(256,),
+         kv_lens=(4096,),
+     ),
+     "varlen-ragged-longtail": dict(
+         mode="varlen",
+         q_lens=(1, 1, 8, 128),
+         kv_lens=(4096, 8192, 512, 2048),
+         num_blocks=512,
+     ),
+ }
+
+ GROUPS = {
+     "all": tuple(CASES),
+     "decode": tuple(name for name in CASES if name.startswith("decode-")),
+     "varlen": tuple(name for name in CASES if name.startswith("varlen-")),
+     "small": ("decode-small", "varlen-light"),
+     "typical": ("decode-typical", "varlen-typical"),
+     "long": (
+         "decode-big-head",
+         "decode-long",
+         "varlen-single-long",
+         "varlen-ragged-longtail",
+     ),
+ }

That would give a simpler user-facing interface like:

python -m tools.benchmark.attention_benchmark
python -m tools.benchmark.attention_benchmark --group decode
python -m tools.benchmark.attention_benchmark --group small
python -m tools.benchmark.attention_benchmark --cases decode-small,varlen-light
python -m tools.benchmark.attention_benchmark --mode decode --batch-size 8 --kv-lens 2048
python -m tools.benchmark.attention_benchmark --group decode --num-layers 10 --iters 200

This keeps the tool self-contained, removes YAML / PyYAML usage from this benchmark path, and makes the preset catalog easier to read, validate, and refactor.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also revised the tools/README.md and ensure it is easy to understand and run specific command

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: kingwl <kingwenlu@gmail.com>
@Kingwl Kingwl force-pushed the feat/add-benchmark-script branch from 0e67258 to c71c599 Compare March 20, 2026 06:44
Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kingwl Looks good. One thing before I approve:

Could you please run these and paste the output in the PR, so we can confirm the tool works end-to-end as documented?

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2

python -m tools.benchmark.attention_benchmark --mode decode --batch-size 4 --kv-lens 512 --backend v1,v2 --iters 5 --warmup 2

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2 --output-json /tmp/attention.json

cat /tmp/attention.json

@Kingwl
Copy link
Copy Markdown
Contributor Author

Kingwl commented Mar 20, 2026

Sure

python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2

Metal Attention Benchmark
cases: decode-small, varlen-light
num_layers: 1  block_size: 16  dtype: float16  warmup: 2  iters: 5  seed: 0

case  | type   | batch | shape                      | v1    | v1_vs_best | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
------+--------+-------+----------------------------+-------+------------+-------+-------------+----------+------------------+-------+-------------
small | decode | 1     | B=1, q=1, kv=128           | 0.408 | 120.3%     | 0.363 | 107.1%      | 0.339    | 100.0% best      | 0.409 | 120.4%      
light | varlen | 4     | 1/128 4/256 16/512 64/1024 | N/A   | -          | 1.153 | 100.0% best | 1.185    | 102.8%           | 1.489 | 129.1%   
python -m tools.benchmark.attention_benchmark --mode decode --batch-size 4 --kv-lens 512 --backend v1,v2 --iters 5 --warmup 2

Metal Attention Benchmark
case: custom
mode: decode
workload: batch=4, q_len=1, kv_len=[512, 512, 512, 512]
heads(q/kv): 8/8  head_dim: 128  block_size: 16  num_blocks: 256  num_layers: 1
dtype: float16  warmup: 2  iters: 5  seed: 0

case   | type   | batch | shape            | v1    | v1_vs_best  | v2    | v2_vs_best
-------+--------+-------+------------------+-------+-------------+-------+-----------
custom | decode | 4     | B=4, q=1, kv=512 | 0.428 | 100.0% best | 0.455 | 106.4%  
python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2 --output-json /tmp/attention.json

cat /tmp/attention.json

Metal Attention Benchmark
cases: decode-small, varlen-light
num_layers: 1  block_size: 16  dtype: float16  warmup: 2  iters: 5  seed: 0

case  | type   | batch | shape                      | v1    | v1_vs_best | v2    | v2_vs_best  | textbook | textbook_vs_best | sdpa  | sdpa_vs_best
------+--------+-------+----------------------------+-------+------------+-------+-------------+----------+------------------+-------+-------------
small | decode | 1     | B=1, q=1, kv=128           | 0.371 | 117.7%     | 0.317 | 100.6%      | 0.315    | 100.0% best      | 0.465 | 147.8%      
light | varlen | 4     | 1/128 4/256 16/512 64/1024 | N/A   | -          | 1.176 | 100.0% best | 1.186    | 100.8%           | 1.462 | 124.3%      
{
  "summary": {
    "cases": [
      "decode-small",
      "varlen-light"
    ],
    "num_layers": 1,
    "block_size": 16,
    "dtype": "float16",
    "warmup": 2,
    "iters": 5,
    "seed": 0
  },
  "rows": [
    {
      "case": "small",
      "case_name": "decode-small",
      "type": "decode",
      "batch": 1,
      "shape": "B=1, q=1, kv=128",
      "v1": 0.371,
      "v1_vs_best": 117.7,
      "v2": 0.317,
      "v2_vs_best": 100.6,
      "textbook": 0.315,
      "textbook_vs_best": 100.0,
      "sdpa": 0.465,
      "sdpa_vs_best": 147.8
    },
    {
      "case": "light",
      "case_name": "varlen-light",
      "type": "varlen",
      "batch": 4,
      "shape": "1/128 4/256 16/512 64/1024",
      "v1": null,
      "v1_vs_best": null,
      "v2": 1.176,
      "v2_vs_best": 100.0,
      "textbook": 1.186,
      "textbook_vs_best": 100.8,
      "sdpa": 1.462,
      "sdpa_vs_best": 124.3
    }
  ]
}

Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LxYuan0420
Copy link
Copy Markdown
Collaborator

@Kingwl Could you please run the formatter / lint / type-check locally and push a fix?

@WindChimeRan
Copy link
Copy Markdown
Collaborator

WindChimeRan commented Mar 20, 2026

@Kingwl Could you please provide any hypothesis on why v1 > v2 on decode-big-head (and is it significantly better or just an acceptable regression)?

This question is not blocking. Please feel free to merge.

I'm afraid that we have to deprecate v1 soon because it's not compatible with mixed prefiling & decoding. I just don't want to bury this key findings.

Signed-off-by: kingwl <kingwenlu@gmail.com>
@Kingwl
Copy link
Copy Markdown
Contributor Author

Kingwl commented Mar 20, 2026

Fixed lint/format/typecheck issues with a new commit.

@Kingwl
Copy link
Copy Markdown
Contributor Author

Kingwl commented Mar 20, 2026

I ran a benchmark matrix:

ASSUMPTIONS: decode, batch_size=8, kv_lens=2048, backend=v1,v2, warmup=10, iters=500, dtype=float16, num_layers=1, repeats=5

| num_q_heads | num_kv_heads | q_per_kv | head_dim | v1 mean ms | v1 std | v2 mean ms | v2 std | v2 / v1 |
|---:|---:|---:|---:|---:|---:|---:|---:|:---|
| 8 | 8 | 1 | 128 | 0.377 | 0.101 | 0.332 | 0.027 | 0.88x |
| 8 | 8 | 1 | 256 | 0.397 | 0.029 | 0.399 | 0.016 | 1.01x slower |
| 8 | 4 | 2 | 128 | 0.323 | 0.013 | 0.296 | 0.004 | 0.92x |
| 8 | 4 | 2 | 256 | 0.370 | 0.093 | 0.344 | 0.024 | 0.93x |
| 8 | 2 | 4 | 128 | 0.312 | 0.012 | 0.294 | 0.003 | 0.94x |
| 8 | 2 | 4 | 256 | 0.343 | 0.004 | 0.322 | 0.005 | 0.94x |
| 16 | 8 | 2 | 128 | 0.339 | 0.008 | 0.326 | 0.007 | 0.96x |
| 16 | 8 | 2 | 256 | 0.448 | 0.020 | 0.448 | 0.014 | 1.00x |
| 16 | 4 | 4 | 128 | 0.345 | 0.013 | 0.339 | 0.025 | 0.98x |
| 16 | 4 | 4 | 256 | 0.448 | 0.061 | 0.434 | 0.021 | 0.97x |
| 16 | 2 | 8 | 128 | 0.326 | 0.008 | 0.331 | 0.023 | 1.02x slower |
| 16 | 2 | 8 | 256 | 0.435 | 0.014 | 0.445 | 0.015 | 1.02x slower |
| 32 | 8 | 4 | 128 | 0.443 | 0.011 | 0.428 | 0.008 | 0.97x |
| 32 | 8 | 4 | 256 | 0.564 | 0.005 | 0.574 | 0.002 | 1.02x slower |
| 32 | 4 | 8 | 128 | 0.456 | 0.012 | 0.443 | 0.003 | 0.97x |
| 32 | 4 | 8 | 256 | 0.550 | 0.014 | 0.591 | 0.053 | 1.07x slower |
| 32 | 2 | 16 | 128 | 0.443 | 0.008 | 0.434 | 0.010 | 0.98x |
| 32 | 2 | 16 | 256 | 0.555 | 0.008 | 0.566 | 0.002 | 1.02x slower |

Upper to 1.0 line means v2 is slower:

image

Current repeated decode benchmarks do not show a broad v2 regression versus v1. The main sensitivity appears to come from larger head_dim and, secondarily, larger num_q_heads / higher q_per_kv, which can push some shapes from slightly faster to roughly parity or slightly slower.

@LxYuan0420
Copy link
Copy Markdown
Collaborator

@WindChimeRan WDYT?

@WindChimeRan
Copy link
Copy Markdown
Collaborator

I ran a benchmark matrix:

ASSUMPTIONS: decode, batch_size=8, kv_lens=2048, backend=v1,v2, warmup=10, iters=500, dtype=float16, num_layers=1, repeats=5

| num_q_heads | num_kv_heads | q_per_kv | head_dim | v1 mean ms | v1 std | v2 mean ms | v2 std | v2 / v1 |
|---:|---:|---:|---:|---:|---:|---:|---:|:---|
| 8 | 8 | 1 | 128 | 0.377 | 0.101 | 0.332 | 0.027 | 0.88x |
| 8 | 8 | 1 | 256 | 0.397 | 0.029 | 0.399 | 0.016 | 1.01x slower |
| 8 | 4 | 2 | 128 | 0.323 | 0.013 | 0.296 | 0.004 | 0.92x |
| 8 | 4 | 2 | 256 | 0.370 | 0.093 | 0.344 | 0.024 | 0.93x |
| 8 | 2 | 4 | 128 | 0.312 | 0.012 | 0.294 | 0.003 | 0.94x |
| 8 | 2 | 4 | 256 | 0.343 | 0.004 | 0.322 | 0.005 | 0.94x |
| 16 | 8 | 2 | 128 | 0.339 | 0.008 | 0.326 | 0.007 | 0.96x |
| 16 | 8 | 2 | 256 | 0.448 | 0.020 | 0.448 | 0.014 | 1.00x |
| 16 | 4 | 4 | 128 | 0.345 | 0.013 | 0.339 | 0.025 | 0.98x |
| 16 | 4 | 4 | 256 | 0.448 | 0.061 | 0.434 | 0.021 | 0.97x |
| 16 | 2 | 8 | 128 | 0.326 | 0.008 | 0.331 | 0.023 | 1.02x slower |
| 16 | 2 | 8 | 256 | 0.435 | 0.014 | 0.445 | 0.015 | 1.02x slower |
| 32 | 8 | 4 | 128 | 0.443 | 0.011 | 0.428 | 0.008 | 0.97x |
| 32 | 8 | 4 | 256 | 0.564 | 0.005 | 0.574 | 0.002 | 1.02x slower |
| 32 | 4 | 8 | 128 | 0.456 | 0.012 | 0.443 | 0.003 | 0.97x |
| 32 | 4 | 8 | 256 | 0.550 | 0.014 | 0.591 | 0.053 | 1.07x slower |
| 32 | 2 | 16 | 128 | 0.443 | 0.008 | 0.434 | 0.010 | 0.98x |
| 32 | 2 | 16 | 256 | 0.555 | 0.008 | 0.566 | 0.002 | 1.02x slower |

Upper to 1.0 line means v2 is slower:

image Current repeated decode benchmarks do not show a broad v2 regression versus v1. The main sensitivity appears to come from larger head_dim and, secondarily, larger num_q_heads / higher q_per_kv, which can push some shapes from slightly faster to roughly parity or slightly slower.

Thanks for the in-depth analysis. The regression seems to be non-significant and acceptable.

So I think It's a good time to deprecate the whole v1 kernels now (because v1 has no varlen and inherently incompatible with vllm v1's unified prefilling and decodiing).

Side notes from claude code opus 4.6 on hypothesis (I haven't taken a closer look):

The most actionable optimization would be to profile register spills using Metal's GPU profiler (Xcode Instruments → Metal System Trace) at head_dim=256 vs 128. If spills appear at 256, the fix might be as simple as reducing NUM_WARPS from 8 to 4 for the large-head codepath.

@WindChimeRan WindChimeRan merged commit eb83460 into vllm-project:main Mar 20, 2026
5 checks passed
@WindChimeRan
Copy link
Copy Markdown
Collaborator

BTW, because of the merge of #172 , now kernel v1 and kernel v1 benchmark become dead code. We need to clean them up in the following PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark kernel_v1, kernel_v2, textbook attention, and mlx sdpa

3 participants