[Paged KV] Add paged attention deterministic smoke test#138
[Paged KV] Add paged attention deterministic smoke test#138LxYuan0420 merged 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
|
@LxYuan0420 request for review |
LxYuan0420
left a comment
There was a problem hiding this comment.
Nice direction and the tools/ helper is a good addition.
To recap the main issue I ran into locally: on macOS 15 (HF compat kernel 8968951), this test can flake under the current batched llm.generate(PROMPTS, ...) setup. Specifically, greedy output for One plus one equals sometimes diverges from both golden sets.
For CI stability, I think we should make the execution mode explicit. Would you be open to forcing single-seq execution (LLM(..., max_num_seqs=1)) for this test?
If your intent is to also cover batched behavior, maybe we can add a separate batched smoke test with a weaker assertion (since exact token IDs aren’t batch-invariant on Metal today)
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>
LxYuan0420
left a comment
There was a problem hiding this comment.
Thanks for iterating on this. Forcing single-seq execution (LLM(..., max_num_seqs=1)) resolves the macOS 15 flake I hit with batched scheduling.
I also pushed a small maintainer tweak on top to (1) fix the generator script name in comments/usage and (2) make the autouse env fixture respect any user-provided env overrides so the MLX run instructions match actual behavior.
LGTM.
Re batch-invariant determinism: I’m not expecting that in this PR (I only wanted a CI-stable single-seq golden smoke test). Let’s defer any batch-invariant feature/test until we support real continuous batching on Metal.
Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>


Summary
test_paged_deterministic.py: 5-prompt smoke test using vLLM offline inference (temp=0, greedy) against hardcoded golden token IDs from Qwen3-0.6Bmainfrom both MLX inline cache and HF paged KV cache pathstools/gen_golden.pyhelper to regenerate golden valuesMotivation
Prerequisite for the native Metal kernel PR (#136). After inlining the vendored Metal shaders, paged attention output must remain identical to the current HF kernel baseline. This test anchors that.
Test
python -m pytest tests/test_paged_deterministic.py -v -s(paged path by default)mainwith HF kernel: 5/5Relevant Issue & PR
upstream batch invariant feature
Batch invariant is hardware & kernel dependent. Supporting this feature is non-trivial on metal.
output example:
