[Paged KV] Add paged attention deterministic smoke test by WindChimeRan · Pull Request #138 · vllm-project/vllm-metal

WindChimeRan · 2026-03-05T19:14:01Z

Summary

Add test_paged_deterministic.py: 5-prompt smoke test using vLLM offline inference (temp=0, greedy) against hardcoded golden token IDs from Qwen3-0.6B
Golden values generated on main from both MLX inline cache and HF paged KV cache paths
Add tools/gen_golden.py helper to regenerate golden values

Motivation

Prerequisite for the native Metal kernel PR (#136). After inlining the vendored Metal shaders, paged attention output must remain identical to the current HF kernel baseline. This test anchors that.

Test

python -m pytest tests/test_paged_deterministic.py -v -s (paged path by default)
Passes on main with HF kernel: 5/5

Relevant Issue & PR

Issue Metal paged-attention parity mismatch vs standard path #119
PR [Paged KV] Inline metal kernel. deprecate hf pytorch kernel #136 : This inline metal kernel need to either pass this test, or explain the possible non-deterministics from the kernel.

upstream batch invariant feature

blog: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
main feature: [Feature]: Batch Invariant Feature and Performance Optimization vllm#27433
vllm upstream batch invariant feature is only compatible with H / B series NVIDIA GPU. A100 not working. See my exp results https://github.com/WindChimeRan/spec_deterministic
community work: [Feature] Batch-Invariant Support for FA2 and LoRA vllm#30018

Batch invariant is hardware & kernel dependent. Supporting this feature is non-trivial on metal.

output example:

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-03-05T19:33:27Z

@LxYuan0420 request for review

LxYuan0420

Nice direction and the tools/ helper is a good addition.

To recap the main issue I ran into locally: on macOS 15 (HF compat kernel 8968951), this test can flake under the current batched llm.generate(PROMPTS, ...) setup. Specifically, greedy output for One plus one equals sometimes diverges from both golden sets.

For CI stability, I think we should make the execution mode explicit. Would you be open to forcing single-seq execution (LLM(..., max_num_seqs=1)) for this test?

If your intent is to also cover batched behavior, maybe we can add a separate batched smoke test with a weaker assertion (since exact token IDs aren’t batch-invariant on Metal today)

LxYuan0420 · 2026-03-06T03:45:40Z

On my machine, I need to set max_seq_nums=1 to pass the test

EDIT Typo: it is max_num_seqs=1 (vLLM LLM(..., max_num_seqs=1)), not max_seq_nums

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-03-06T04:29:19Z

On my machine, I need to set max_seq_nums=1 to pass the test
EDIT Typo: it is max_num_seqs=1 `(vLLM LLM(..., max_num_seqs=1))`, not max_seq_nums

This is really intriguing... Maybe it's due to Metal 3 vs. Metal 4, or the 0.2 × RAM auto-calculated max sequence length, or differences in our chips.

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420

Thanks for iterating on this. Forcing single-seq execution (LLM(..., max_num_seqs=1)) resolves the macOS 15 flake I hit with batched scheduling.

I also pushed a small maintainer tweak on top to (1) fix the generator script name in comments/usage and (2) make the autouse env fixture respect any user-provided env overrides so the MLX run instructions match actual behavior.

LGTM.

Re batch-invariant determinism: I’m not expecting that in this PR (I only wanted a CI-stable single-seq golden smoke test). Let’s defer any batch-invariant feature/test until we support real continuous batching on Metal.

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

WindChimeRan added 2 commits March 5, 2026 13:08

add deterministic test

0e22c96

Signed-off-by: ran <hzz5361@psu.edu>

fix linter

482e6b2

Signed-off-by: ran <hzz5361@psu.edu>

LxYuan0420 requested changes Mar 6, 2026

View reviewed changes

Comment thread tests/test_paged_deterministic.py Outdated

Comment thread tests/test_paged_deterministic.py Outdated

Comment thread tests/test_paged_deterministic.py

Comment thread tools/gen_golden.py Outdated

WindChimeRan added 3 commits March 5, 2026 22:01

set seq 1 to ensure deterministic

c01378e

Signed-off-by: ran <hzz5361@psu.edu>

more descriptive name

e56dde4

Signed-off-by: ran <hzz5361@psu.edu>

use a more elegant way to handle env var

76929f8

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan requested a review from LxYuan0420 March 6, 2026 04:26

Fix deterministic test docs and env defaults

1ddc50d

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 approved these changes Mar 6, 2026

View reviewed changes

Refactor env defaults

8c357bc

Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

LxYuan0420 merged commit 6ecf38f into vllm-project:main Mar 6, 2026
5 checks passed

WindChimeRan mentioned this pull request Mar 9, 2026

[RoadMap] [Paged KV] Continuous Batching + Chunked Prefilling + paged varlen flash att #148

Closed

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paged KV] Add paged attention deterministic smoke test#138

[Paged KV] Add paged attention deterministic smoke test#138
LxYuan0420 merged 7 commits intovllm-project:mainfrom
WindChimeRan:test_deterministic

WindChimeRan commented Mar 5, 2026 •

edited

Loading

Uh oh!

WindChimeRan commented Mar 5, 2026

Uh oh!

LxYuan0420 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LxYuan0420 commented Mar 6, 2026 •

edited

Loading

Uh oh!

WindChimeRan commented Mar 6, 2026

Uh oh!

LxYuan0420 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WindChimeRan commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test

Relevant Issue & PR

Uh oh!

WindChimeRan commented Mar 5, 2026

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LxYuan0420 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WindChimeRan commented Mar 6, 2026

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WindChimeRan commented Mar 5, 2026 •

edited

Loading

LxYuan0420 commented Mar 6, 2026 •

edited

Loading