Add Metal attention benchmark tool#178
Conversation
5833d59 to
9df31ac
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5833d593e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
tools/bench_metal_attention.py
Outdated
| tokens_per_s=None, | ||
| notes="unsupported in varlen mode", | ||
| ) | ||
| fn = lambda: run_v1_paged_attention( |
There was a problem hiding this comment.
Time v1 backend without helper-side barriers
time_backend already synchronizes before and after every timed iteration, but the v1 path calls run_v1_paged_attention, which itself performs mx.eval(...) and mx.synchronize() each call. That adds extra fixed overhead only to v1, so the reported v1 latency/tokens/s is inflated relative to v2/textbook and the cross-backend comparison is not apples-to-apples.
Useful? React with 👍 / 👎.
9df31ac to
e739db0
Compare
e739db0 to
8d9bfb7
Compare
LxYuan0420
left a comment
There was a problem hiding this comment.
Left some comment; do use -s when commit file
67ca1d3 to
8fb00e2
Compare
|
@Kingwl Thanks for the contributions!
Notes: Please don't spend too much energy on the v1 kernel. Once we are sure v2>>v1, we should then deprecate v1 immediately because it has design flaws. |
|
63426ab to
0e67258
Compare
|
LGTM. Do you want to take another look? I didn't check it line by line. @LxYuan0420 |
LxYuan0420
left a comment
There was a problem hiding this comment.
One minor feedback to change before we merge.
Good work overall!
There was a problem hiding this comment.
Would prefer to keep this benchmark self-contained in this file, but avoid introducing a small DSL here.
As a general principle, for developer-only tooling we should prefer plain Python over a config DSL unless there is a clear payoff, for example non-engineers need to edit it or the same config must be shared across multiple tools/languages. Here, configs/*.yaml plus the merge logic effectively become a mini language for benchmark presets, and I do not think that extra layer is buying us much.
I think this would be simpler if we keep the preset catalog directly in attention_benchmark.py as built-in constants, and support:
- built-in groups such as
all,decode,varlen,small,typical,long - explicit case selection
- fully manual one-off workloads via the existing flags
One important detail: please keep q_lens / kv_lens typed in code as tuples/lists, rather than comma-separated strings, so we do not just move the DSL from YAML into Python strings.
Something like this would be much easier to maintain:
- # YAML config files + loader + merge logic
+ DEFAULTS = dict(
+ backend="v1,v2,textbook,sdpa",
+ num_q_heads=8,
+ num_kv_heads=8,
+ head_dim=128,
+ block_size=16,
+ num_blocks=256,
+ dtype="float16",
+ warmup=10,
+ iters=100,
+ seed=0,
+ num_layers=1,
+ )
+
+ CASES = {
+ "decode-small": dict(mode="decode", batch_size=1, kv_lens=(128,)),
+ "decode-typical": dict(mode="decode", batch_size=8, kv_lens=(2048,)),
+ "decode-big-head": dict(
+ mode="decode",
+ batch_size=8,
+ kv_lens=(2048,),
+ num_q_heads=32,
+ num_kv_heads=8,
+ head_dim=256,
+ ),
+ "decode-long": dict(
+ mode="decode",
+ batch_size=32,
+ kv_lens=(8192,),
+ num_blocks=512,
+ ),
+ "varlen-light": dict(
+ mode="varlen",
+ q_lens=(1, 4, 16, 64),
+ kv_lens=(128, 256, 512, 1024),
+ ),
+ "varlen-typical": dict(
+ mode="varlen",
+ q_lens=(32, 64, 128, 256),
+ kv_lens=(512, 1024, 2048, 4096),
+ ),
+ "varlen-single-long": dict(
+ mode="varlen",
+ q_lens=(256,),
+ kv_lens=(4096,),
+ ),
+ "varlen-ragged-longtail": dict(
+ mode="varlen",
+ q_lens=(1, 1, 8, 128),
+ kv_lens=(4096, 8192, 512, 2048),
+ num_blocks=512,
+ ),
+ }
+
+ GROUPS = {
+ "all": tuple(CASES),
+ "decode": tuple(name for name in CASES if name.startswith("decode-")),
+ "varlen": tuple(name for name in CASES if name.startswith("varlen-")),
+ "small": ("decode-small", "varlen-light"),
+ "typical": ("decode-typical", "varlen-typical"),
+ "long": (
+ "decode-big-head",
+ "decode-long",
+ "varlen-single-long",
+ "varlen-ragged-longtail",
+ ),
+ }That would give a simpler user-facing interface like:
python -m tools.benchmark.attention_benchmark
python -m tools.benchmark.attention_benchmark --group decode
python -m tools.benchmark.attention_benchmark --group small
python -m tools.benchmark.attention_benchmark --cases decode-small,varlen-light
python -m tools.benchmark.attention_benchmark --mode decode --batch-size 8 --kv-lens 2048
python -m tools.benchmark.attention_benchmark --group decode --num-layers 10 --iters 200This keeps the tool self-contained, removes YAML / PyYAML usage from this benchmark path, and makes the preset catalog easier to read, validate, and refactor.
There was a problem hiding this comment.
Please also revised the tools/README.md and ensure it is easy to understand and run specific command
Signed-off-by: kingwl <kingwenlu@gmail.com>
0e67258 to
c71c599
Compare
LxYuan0420
left a comment
There was a problem hiding this comment.
@Kingwl Looks good. One thing before I approve:
Could you please run these and paste the output in the PR, so we can confirm the tool works end-to-end as documented?
python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2
python -m tools.benchmark.attention_benchmark --mode decode --batch-size 4 --kv-lens 512 --backend v1,v2 --iters 5 --warmup 2
python -m tools.benchmark.attention_benchmark --group small --iters 5 --warmup 2 --output-json /tmp/attention.json
cat /tmp/attention.json|
Sure |
|
@Kingwl Could you please run the formatter / lint / type-check locally and push a fix? |
|
@Kingwl Could you please provide any hypothesis on why This question is not blocking. Please feel free to merge. I'm afraid that we have to deprecate v1 soon because it's not compatible with mixed prefiling & decoding. I just don't want to bury this key findings. |
Signed-off-by: kingwl <kingwenlu@gmail.com>
|
Fixed lint/format/typecheck issues with a new commit. |
|
@WindChimeRan WDYT? |
|
BTW, because of the merge of #172 , now kernel v1 and kernel v1 benchmark become dead code. We need to clean them up in the following PRs. |


Summary
This PR adds a local Metal attention benchmark under
tools/benchmark/attention_benchmark.py.It supports both preset workloads and fully manual one-off workloads. Presets are defined directly in Python as built-in
CASESandGROUPS, with built-in groups such asall,decode,varlen,small,typical, andlong.The benchmark supports
num_layersfor multi-layer benchmarking and reports per-layer average latency, following the upstream design.By default, preset runs compare
v1,v2,textbook, andsdpa. Passing--backend allalso includessdpa-compute-onlyas an additional baseline.The benchmark prints a text table to stdout and also supports
--output-jsonand--output-csvexports.It also factors shared benchmark/reference helpers into
tools/attention_bench_utils.pyand adds usage notes totools/README.md.Results may vary slightly run to run.
Single Layer Metal Attention Benchmark
Multiple Layer Metal Attention Benchmark
Single-layer summary:
v2is already the best backend on the main representative cases (decode-typicalanddecode-long).v1still wins ondecode-big-head, so large-head decode remains a clear gap forv2.v2winsvarlen-lightandvarlen-ragged-longtail, whilesdpawins the heavier cases (varlen-typicalandvarlen-single-long).Compared with the multi-layer run (
num_layers=10):v2still leads ondecode-typicalanddecode-long, andv1still leads ondecode-big-head.v2andsdpa; in the 10-layer run, all current varlen cases shift totextbook.Fixed #175