Skip to content

[kernel] remove eager kernel due to the new primitive#372

Merged
WindChimeRan merged 2 commits into
vllm-project:mainfrom
WindChimeRan:cleanup/remove-eager-gqa-binding
May 14, 2026
Merged

[kernel] remove eager kernel due to the new primitive#372
WindChimeRan merged 2 commits into
vllm-project:mainfrom
WindChimeRan:cleanup/remove-eager-gqa-binding

Conversation

@WindChimeRan
Copy link
Copy Markdown
Collaborator

cleanup eager kernel, due to the working primitive (#225)

now eager kernel is dead code

Signed-off-by: ran <hzz5361@psu.edu>
Comment thread tests/test_metal_unified_attention.py
Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan merged commit 8753b8c into vllm-project:main May 14, 2026
5 checks passed
WindChimeRan added a commit to WindChimeRan/vllm-metal that referenced this pull request May 14, 2026
Resolve modify/delete conflict on test_metal_unified_attention.py:
upstream removed the eager kernel test (PR vllm-project#372), we had added prefill
test cases. Accept deletion, migrate prefill-only test parametrizations
to test_primitive_and_donation.py::test_primitive_vs_reference_varlen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ricky-chaoju pushed a commit that referenced this pull request May 15, 2026
## Summary

After #372 removed the eager `paged_attention_*` bindings,
`PagedAttentionPrimitive::eval_gpu` is the only caller of
`dispatch_paged_attention_v2_online` (and indirectly
`dispatch_paged_attention_tiled`), and it always passes
`from_primitive=true`. The `from_primitive` parameter, the
`!from_primitive` `add_temporary` branches in both dispatchers, and the
`add_paged_attn_temporaries` helper that those branches called are all
unreachable.

`vllm_metal/metal/paged_ops.cpp` — 37 lines deleted:

- `from_primitive` parameter removed from
`dispatch_paged_attention_tiled` and
`dispatch_paged_attention_v2_online`
- the `if (!from_primitive)` `add_temporary_compat` branches removed in
both (including the TurboQuant scale/zero/centroid temporaries)
- `add_paged_attn_temporaries` deleted (only those branches called it)
- call sites in `dispatch_paged_attention_v2_online →
dispatch_paged_attention_tiled` and `PagedAttentionPrimitive::eval_gpu →
dispatch_paged_attention_v2_online` updated to drop the argument

## What stays

`dispatch_reshape_and_cache` keeps its `from_primitive` parameter — it
still has a live eager caller via `reshape_and_cache_impl`, which
`vllm_metal/paged_attention_backend/mha.py:47` calls as part of the
Metal-shader warm-up.

Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants