Fuse reshape_and_cache + paged_attention into a single MLX primitive by WindChimeRan · Pull Request #225 · vllm-project/vllm-metal

WindChimeRan · 2026-04-03T04:44:58Z

Builds on the spike in #209. Related: #188.

What

Replace the eager reshape_and_cache + paged_attention_v2_online dispatch with:

MLX-native scatter for cache writes - pure functional, graph-tracked, donation-eligible. Replaces the custom reshape_and_cache Metal kernel in the production path.
PagedAttentionPrimitive for attention - a read-only primitive that dispatches the paged attention Metal kernel lazily.

Both operations are fully lazy. No per-layer mx.eval or mx.synchronize. The entire 28-layer model builds one lazy graph, evaluated once by the model runner.

Bug fix: custom primitives must not call add_temporary inside eval_gpu. MLX's add_temporary removes buffer pointers from the command encoder's fence tracking, breaking cross-command-buffer synchronization when the graph is evaluated lazily. The fix: from_primitive=true skips all add_temporary calls. MLX's evaluator already manages array lifetimes via the completion handler. This matches the pattern in MLX's official axpby extension example.

Why

The original eager path does 1 mx.eval + 1 mx.synchronize per layer, each a CPU-GPU sync point. With 28 layers, that is 56 sync points per decode step.

The primitive path eliminates all of them. The scatter participates in MLX's lazy graph, and the attention primitive dispatches correctly across command buffer boundaries thanks to the add_temporary fix.

Test

6/6 deterministic golden tests (bit-exact). 362/362 broader suite.

Future Work

Wire partitioned attention (Support partitioned Metal attention #181) into the primitive path
Clean up dead eager code paths (reshape_and_cache, metal_unified_attention, etc.)

Benchmark (sonnet 1024+128, 100 prompts, concurrency 8, 5 warmups)

Metric	main	this PR	Change
Duration (s)	288.71	257.01	-11.0%
Output tok/s	44.34	49.80	+12.3%
Total tok/s	394.96	443.68	+12.3%
Mean TTFT (ms)	4810.26	4771.41	-0.8%
Mean TPOT (ms)	139.98	121.47	-13.2%
P99 TPOT (ms)	169.16	149.75	-11.5%

Full benchmark output

main (paged attention, eager dispatch)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             8
Request rate configured (RPS):           10.00
Benchmark duration (s):                  288.71
Total input tokens:                      101230
Total generated tokens:                  12800
Request throughput (req/s):              0.35
Output token throughput (tok/s):         44.34
Peak output token throughput (tok/s):    112.00
Peak concurrent requests:                11.00
Total token throughput (tok/s):          394.96
---------------Time to First Token----------------
Mean TTFT (ms):                          4810.26
Median TTFT (ms):                        4865.42
P99 TTFT (ms):                           10145.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.98
Median TPOT (ms):                        139.36
P99 TPOT (ms):                           169.16
---------------Inter-token Latency----------------
Mean ITL (ms):                           139.98
Median ITL (ms):                         79.76
P99 ITL (ms):                            2661.48
==================================================

this PR (MLX scatter + paged attention primitive, fully lazy)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             8
Request rate configured (RPS):           10.00
Benchmark duration (s):                  257.01
Total input tokens:                      101230
Total generated tokens:                  12800
Request throughput (req/s):              0.39
Output token throughput (tok/s):         49.80
Peak output token throughput (tok/s):    120.00
Peak concurrent requests:                11.00
Total token throughput (tok/s):          443.68
---------------Time to First Token----------------
Mean TTFT (ms):                          4771.41
Median TTFT (ms):                        5165.51
P99 TTFT (ms):                           9396.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          121.47
Median TPOT (ms):                        118.78
P99 TPOT (ms):                           149.75
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.47
Median ITL (ms):                         70.34
P99 ITL (ms):                            2660.84
==================================================

Benchmark config

Model: Qwen/Qwen3-0.6B
Dataset: sonnet (1024 input + 128 output)
Prompts: 100, rate: 10, concurrency: 8, warmups: 5
Memory fraction: 0.3 (paged path)
Hardware: Apple M1 Pro, 32 GB RAM

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-04-03T08:19:00Z

Design: functional cache writes via MLX scatter

From a functional programming semantics perspective, reshape_and_cache doesn't mutate the "whole cache." It writes to specific slots determined by slot_mapping. Those slots (block_idx, block_offset) are disjoint from all other slots - no two tokens write to the same slot. Conceptually:

new_cache = old_cache                       # copy the whole thing
new_cache[slot_mapping[i]] = new_kv[i]      # write to specific slots

This is exactly what mx.scatter or slice assignment does - a pure functional operation. The result is a "new" array. MLX knows how to handle this: When the old cache reference has use_count == 1 (nobody else holds it), MLX can donate the buffer - the "copy" reuses the same physical memory. Zero allocation, zero memcpy.

The original reshape_and_cache Metal kernel was an optimization that mutated the cache buffer in-place. This made it a side effect invisible to MLX's computation graph, forcing per-layer mx.eval + mx.synchronize to ensure correctness. By replacing it with MLX's native scatter, the cache write becomes a proper graph node. MLX tracks the dependency, handles buffer donation, and sequences it correctly across command buffer boundaries - no explicit sync needed.

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-04-03T18:14:44Z

@ericcurtin @Kingwl

This is the last piece of the paged attention puzzle! Now only one MLX's lazy graph, no unnecessary sync.

ericcurtin · 2026-04-04T16:46:09Z

Cache rebind may break shared references (prefix caching), no test coverage for primitive path, mx.array(0)+overwrite_descriptor is fragile, UnaryPrimitive with 6 inputs is semantically misleading

WindChimeRan · 2026-04-06T05:24:33Z

@ericcurtin

UnaryPrimitive with 6 inputs is semantically misleading

This is from mlx, UnaryPrimitive means single-output, not single-input.

WindChimeRan · 2026-04-06T06:38:32Z

mx.array(0)+overwrite_descriptor is fragile

Agree it is indeed fragile, but I don't have a better way for now. The new test should cover it if the upstream mlx update break it.

WindChimeRan · 2026-04-06T06:41:09Z

Cache rebind may break shared references (prefix caching)

prefix caching should be applied to block index level, not the array reference level. This should be fine.

Signed-off-by: ran <hzz5361@psu.edu>

Resolve conflict in paged_ops.cpp: keep both paged_attention_primitive (ours) and gdn_linear_attention (upstream vllm-project#226) bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-04-06T06:56:44Z

@ericcurtin added some tests. Request for review

…llm-project#225) <img width="2493" height="927" alt="bench_primitive_comparison" src="https://github.com/user-attachments/assets/54a5f038-f16c-4bbe-a96d-343dcfae04fa" /> Builds on the spike in vllm-project#209. Related: vllm-project#188. ### What Replace the eager `reshape_and_cache` + `paged_attention_v2_online` dispatch with: 1. **MLX-native scatter** for cache writes - pure functional, graph-tracked, donation-eligible. Replaces the custom `reshape_and_cache` Metal kernel in the production path. 2. **`PagedAttentionPrimitive`** for attention - a read-only primitive that dispatches the paged attention Metal kernel lazily. Both operations are fully lazy. No per-layer `mx.eval` or `mx.synchronize`. The entire 28-layer model builds one lazy graph, evaluated once by the model runner. **Bug fix**: custom primitives must not call `add_temporary` inside `eval_gpu`. MLX's `add_temporary` removes buffer pointers from the command encoder's fence tracking, breaking cross-command-buffer synchronization when the graph is evaluated lazily. The fix: `from_primitive=true` skips all `add_temporary` calls. MLX's evaluator already manages array lifetimes via the completion handler. This matches the pattern in MLX's official `axpby` extension example. ### Why The original eager path does **1 `mx.eval` + 1 `mx.synchronize`** per layer, each a CPU-GPU sync point. With 28 layers, that is 56 sync points per decode step. The primitive path eliminates all of them. The scatter participates in MLX's lazy graph, and the attention primitive dispatches correctly across command buffer boundaries thanks to the `add_temporary` fix. ### Test 6/6 deterministic golden tests (bit-exact). 362/362 broader suite. ### Future Work - Wire partitioned attention (vllm-project#181) into the primitive path - Clean up dead eager code paths (`reshape_and_cache`, `metal_unified_attention`, etc.) ### Benchmark (sonnet 1024+128, 100 prompts, concurrency 8, 5 warmups) | Metric | main | this PR | Change | |---|---:|---:|---:| | Duration (s) | 288.71 | 257.01 | **-11.0%** | | Output tok/s | 44.34 | 49.80 | **+12.3%** | | Total tok/s | 394.96 | 443.68 | **+12.3%** | | Mean TTFT (ms) | 4810.26 | 4771.41 | -0.8% | | Mean TPOT (ms) | 139.98 | 121.47 | **-13.2%** | | P99 TPOT (ms) | 169.16 | 149.75 | **-11.5%** | <details> <summary>Full benchmark output</summary> **main (paged attention, eager dispatch)** ``` ============ Serving Benchmark Result ============ Successful requests: 100 Failed requests: 0 Maximum request concurrency: 8 Request rate configured (RPS): 10.00 Benchmark duration (s): 288.71 Total input tokens: 101230 Total generated tokens: 12800 Request throughput (req/s): 0.35 Output token throughput (tok/s): 44.34 Peak output token throughput (tok/s): 112.00 Peak concurrent requests: 11.00 Total token throughput (tok/s): 394.96 ---------------Time to First Token---------------- Mean TTFT (ms): 4810.26 Median TTFT (ms): 4865.42 P99 TTFT (ms): 10145.45 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 139.98 Median TPOT (ms): 139.36 P99 TPOT (ms): 169.16 ---------------Inter-token Latency---------------- Mean ITL (ms): 139.98 Median ITL (ms): 79.76 P99 ITL (ms): 2661.48 ================================================== ``` **this PR (MLX scatter + paged attention primitive, fully lazy)** ``` ============ Serving Benchmark Result ============ Successful requests: 100 Failed requests: 0 Maximum request concurrency: 8 Request rate configured (RPS): 10.00 Benchmark duration (s): 257.01 Total input tokens: 101230 Total generated tokens: 12800 Request throughput (req/s): 0.39 Output token throughput (tok/s): 49.80 Peak output token throughput (tok/s): 120.00 Peak concurrent requests: 11.00 Total token throughput (tok/s): 443.68 ---------------Time to First Token---------------- Mean TTFT (ms): 4771.41 Median TTFT (ms): 5165.51 P99 TTFT (ms): 9396.19 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 121.47 Median TPOT (ms): 118.78 P99 TPOT (ms): 149.75 ---------------Inter-token Latency---------------- Mean ITL (ms): 121.47 Median ITL (ms): 70.34 P99 ITL (ms): 2660.84 ================================================== ``` </details> <details> <summary>Benchmark config</summary> - Model: Qwen/Qwen3-0.6B - Dataset: sonnet (1024 input + 128 output) - Prompts: 100, rate: 10, concurrency: 8, warmups: 5 - Memory fraction: 0.3 (paged path) - Hardware: Apple M1 Pro, 32 GB RAM </details> --------- Signed-off-by: ran <hzz5361@psu.edu> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

poc primitive by copy_shared_buffer

7a2ec93

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan force-pushed the feat/fused-reshape-attention-primitive branch from 96bd651 to 7a2ec93 Compare April 3, 2026 04:45

reword the mx.eval issue comment

dd279f0

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan changed the title ~~Fused Page Attention Primitive to save CPU-GPU sync~~ Fuse reshape_and_cache + paged_attention into a single MLX primitive Apr 3, 2026

cleanup dead code and fix CI

7acd288

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan added 2 commits April 3, 2026 11:02

fix add_temporary bug

9800d37

Signed-off-by: ran <hzz5361@psu.edu>

cleanup unececcary change

b531a39

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan marked this pull request as ready for review April 3, 2026 18:11

WindChimeRan marked this pull request as draft April 6, 2026 05:23

WindChimeRan force-pushed the feat/fused-reshape-attention-primitive branch 2 times, most recently from 3416d33 to fc94266 Compare April 6, 2026 06:50

WindChimeRan and others added 3 commits April 6, 2026 01:53

add donation test

eda8ffb

Signed-off-by: ran <hzz5361@psu.edu>

add donation comments

9848cf6

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan force-pushed the feat/fused-reshape-attention-primitive branch from fc94266 to 2f691eb Compare April 6, 2026 06:54

WindChimeRan marked this pull request as ready for review April 6, 2026 06:56

ericcurtin approved these changes Apr 6, 2026

View reviewed changes

ericcurtin merged commit f518143 into vllm-project:main Apr 6, 2026
5 checks passed

WindChimeRan mentioned this pull request Apr 6, 2026

[RFC] [kernel] Paged attention kernels should be MLX primitives to eliminate per-layer sync barriers #188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse reshape_and_cache + paged_attention into a single MLX primitive#225

Fuse reshape_and_cache + paged_attention into a single MLX primitive#225
ericcurtin merged 8 commits intovllm-project:mainfrom
WindChimeRan:feat/fused-reshape-attention-primitive

WindChimeRan commented Apr 3, 2026 •

edited

Loading

Uh oh!

WindChimeRan commented Apr 3, 2026 •

edited

Loading

Uh oh!

WindChimeRan commented Apr 3, 2026 •

edited

Loading

Uh oh!

ericcurtin commented Apr 4, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WindChimeRan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Test

Future Work

Benchmark (sonnet 1024+128, 100 prompts, concurrency 8, 5 warmups)

Uh oh!

WindChimeRan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design: functional cache writes via MLX scatter

Uh oh!

WindChimeRan commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericcurtin commented Apr 4, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

WindChimeRan commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WindChimeRan commented Apr 3, 2026 •

edited

Loading

WindChimeRan commented Apr 3, 2026 •

edited

Loading

WindChimeRan commented Apr 3, 2026 •

edited

Loading